### For this this assignment we were asked to solve 5 points regarding the usage of the spaCy tool

Read [spaCy documentation](https://spacy.io/api/dependencyparser) on dependency parser to learn provided methods.
We start importing spaCy and loading the pipeline. 

I tested my functions with a common sentence often used during classes: ```"I saw a man with a telescope."```

In [1]:
import spacy
from spacy import displacy

txt = "I saw a man with a telescope."

nlp = spacy.load('en_core_web_sm')

## 1° TASK: Extract a path of dependency relations from the ROOT to a token
    - input is a sentence, you parse it and get a Doc object of spaCy
    - for each token the path will be a list of dependency relations, where first element is ROOT

**For task 1, I definited the function** ```getDependecyPath(sentence)```

INPUT: the sentence (type string) to parse

OUTPUT: it returns a path as a list of dipendencies (from ROOT to token).

I used ```token.sent.root``` and ```token.dep_``` respectively to find the sentence root and dependencies of each token (it gives the dependency relation between the token and its parent). 

These let me explore the arcs starting from the token up to the root in order to populate the list which in this way is sorted with the ROOT in the first place. So there was no need to reverse the list.


In [2]:
#Function 1--------------------------------------------------

def getDependecyPath(sentence):
    
    doc = nlp(sentence) #Construct of a Doc object. The most common way to get a Doc object is via the nlp object.
    path = []

    for token in doc:
        tempList = []

        while token != token.sent.root:
            tempList.insert(0, token.text)
            tempList.insert(0, ' ----('+token.dep_+')---> ')
            token = token.sent.root

        tempList.insert(0, token.text)
        tempList.insert(0, ' ----('+token.dep_+')---> ')
        tempList.insert(0, 'ROOT')
        path.insert(0, tempList)

    return path

Here the structure of the path as as shown below:

```['ROOT', '----first dependency relation (ROOT)--->', 'Token', '----second dependency relation--->', 'Token']```

In [3]:
#Test Function 1

dependencyPath = getDependecyPath(txt)

for dp in dependencyPath:
        print(dp)

['ROOT', ' ----(ROOT)---> ', 'saw', ' ----(punct)---> ', '.']
['ROOT', ' ----(ROOT)---> ', 'saw', ' ----(pobj)---> ', 'telescope']
['ROOT', ' ----(ROOT)---> ', 'saw', ' ----(det)---> ', 'a']
['ROOT', ' ----(ROOT)---> ', 'saw', ' ----(prep)---> ', 'with']
['ROOT', ' ----(ROOT)---> ', 'saw', ' ----(dobj)---> ', 'man']
['ROOT', ' ----(ROOT)---> ', 'saw', ' ----(det)---> ', 'a']
['ROOT', ' ----(ROOT)---> ', 'saw']
['ROOT', ' ----(ROOT)---> ', 'saw', ' ----(nsubj)---> ', 'I']


displaCy is an named entity visualizer built-in in spaCy that I used as a dependency visualizer. displaCy is also able to detect that I'm working in a Jupyter notebook, this is why i set ```jupyter=True```. Here's a compact view of the dependency tree.

In [11]:
doc = nlp(txt)
displacy.render(doc, style="dep",options={'compact':True},jupyter=True)

## 2° TASK: Extract subtree of a dependents given a token
    - input is a sentence, you parse it and get a Doc object of spaCy
    - for each token in Doc objects you extract a subtree of its dependents as a list

**For task 2, I define the function** ```getSubtree(sentence)```

INPUT: the sentence (type string) from which I extract the subtree

OUTPUT: It return a subtree converted to list for each token in nlp(sentence)

The attribute ```token.subtree``` is called for each token in the sentence, in order to obtain the subtree of dependents w.r.t the sentence order and it cast to list format.

In [20]:
#Function 2--------------------------------------------------

def getSubtree(sentence):

    return {token.text: list(token.subtree) for token in nlp(sentence)}

To test the function I still gave the txt string as input. 

For each token, the subtree including the token itself is displayed as requested

In [21]:
#Test Function 2

getSubtree(txt)

{'I': [I],
 'saw': [I, saw, a, man, with, a, telescope, .],
 'a': [a],
 'man': [a, man, with, a, telescope],
 'with': [with, a, telescope],
 'telescope': [a, telescope],
 '.': [.]}

## 3° TASK: check if a given list of tokens (segment of a sentence) forms a subtree
    - you parse a sentence and get a Doc object of spaCy
    - providing as an input ordered list of words, you output True/False based on the sequence forming a subtree or not

**For task 3, I used the function** ```is_Subtree(sentence, segment)```

INPUT: the sentence (type string) and a segment as a list of tokens

OUTPUT: It return TRUE if the sentence's tokens (I put in a subList every subtree extracted by ```token.subtrees```) match with all the segment's ones. FALSE otherwise. According to the definition of subtree infact "A subtree of a tree T is a tree S consisting of a node in T and all of its descendants in T". 

Only the main sentence and the test segment obtained as input is parsed using spaCy English model because the segmed is already a list of tokens. 

It is important to underline this condition ```len(segment) == 0:```, in fact an empty list forms a subtree. 

In [None]:
#Function 3--------------------------------------------------

def is_Subtree(sentence, segment):
    doc = nlp(sentence)
    
    if len(segment) == 0:
    
        return True

    for token in doc:
        subList = token.subtree
        if ([tok.text for tok in subList] == segment):
            
            return True
    
    return False

In this example test I gave as input 3 different list of token. 

**Bold** ```'\033[1m' + txt + '\033[0m'``` makes the print easily readble.

In [89]:
#Test Function 3

segment1 =['a','telescope']
segment2 =['with','the','telescope']
segment3 =[]

    if is_Subtree(txt, segment1) == True:
        print('The list of token', segment1, 'is a subtree of ', '\033[1m' + txt + '\033[0m')
    else:
        print('The list of token', segment1, 'is not a subtree of ', '\033[1m' + txt + '\033[0m')
    
    if is_Subtree(txt, segment2) == True:
        print('The list of token', segment2, 'is a subtree of ', '\033[1m' + txt + '\033[0m')
    else:
        print('The list of token', segment2, 'is not a subtree of ', '\033[1m' + txt + '\033[0m')
    
    if is_Subtree(txt, segment3) == True: 
        print('The list of token', segment3, 'is a subtree of ', '\033[1m' + txt + '\033[0m')
    else:
        print('The list of token', segment3, 'is not a subtree of ', '\033[1m' + txt + '\033[0m')

The list of token ['a', 'telescope'] is a subtree of  [1mI saw a man with a telescope.[0m
The list of token ['with', 'the', 'telescope'] is not a subtree of  [1mI saw a man with a telescope.[0m
The list of token [] is a subtree of  [1mI saw a man with a telescope.[0m


## 4° TASK: Identify head of a span, given its tokens
    - input is a sequence of words (not necessarily a sentence)
    - output is the head of the span (single word)

**For task 4, I definited the function** ```getSpanHead(sentence)```

INPUT: the sentence (type string). 

It's also possible to take in input a doc sentence that it's already parsed so it's necessary only to convert in span and extract the head/root with ```span.root```. If instead we already have a span in input, we can return directly.

OUTPUT: It return a token corrisponding to the root of the span (a "slice" of a sentence).

In [119]:
#Function 4--------------------------------------------------

def getSpanHead(sentence):
    doc = nlp(sentence)
    span = doc[:]
    
    return span.root

I test my function with the usual string txt and with the method ```Doc.__getitem__``` that produces in the example a span consisting of tokens 1, 2, 3 and 4.

In [132]:
#Test Function 4

txt1 = 'A book of old english ballads'
doc = nlp(txt1)
span = doc[1:4]

print ('The head of \"', '\033[1m' + txt1 + '\033[0m', '\" is' ,getSpanHead(txt1))
print ('The head of \"',span,'\"is' ,getSpanHead(txt1))

The head of " [1mA book of old english ballads[0m " is book
The head of " book of old "is book


## 5° TASK: Extract sentence subject, direct object and indirect object spans
    - input is a sentence, you parse it and get a Doc object of spaCy
    - output is lists of words that form a span for subject, direct object, and indirect obj (if present, otherwise empty)
    - dict of lists, is better

**For task 5, I definited the function** ```getSentenceDep(sentence)```

INPUT: the sentence (type string) that is being parsed as always

OUTPUT: It returns a dictionary with all identified lists of key dependency labels  

The logic of this function si based on  subtree; It scans all the token in the sentences thanks to the subtree functionality that give us a sequence containing the token and all the token’s syntactic descendants. In this way it's possible to check the equality between the .dep_ and the predefined string. 

To categorize "subjects" I compared ```nsubj``` (nominal subject), ```nsubjpass``` (nominal subject of passive verbs), ```csubj``` (clausal subject, in case it's itself a clause) and ```csubjpass``` (clausal subject of passive verbs).

For direct objects I only considered dobj and, in the end, the labels for indirect objects ```iobj``` and ```dative``` (towards whom or for an action is performed). 
**Spacy from version 3.1.0** replaced iobj with dative, so I decided to add both in case an older version is used.

Once all tokens have been analysed, tokens are appended in the right category and the dictionary  is returned.

In [150]:
#Function 5--------------------------------------------------

def getSentenceDep(sentence):
    doc = nlp(sentence)
    
    dict = {
        'subjects':[], 'direct objects':[], 'indirect objects':[]
    }
    
    for token in doc:
    
        if(token.dep_ == 'nsubj' or token.dep_ == 'nsubjpass' 
           or token.dep_ == 'csubj' or token.dep_ == 'csubjpass'):
            
            for desc in token.subtree:
                dict["subjects"].append(desc.text)
        
        elif(token.dep_ == 'dobj'):
            
            for desc in token.subtree:
                dict["direct objects"].append(desc.text)
   
        elif(token.dep_ == 'dative' or token.dep_ == 'iobj'):
        
            for desc in token.subtree:
                dict["indirect objects"].append(desc.text)
                
    return dict
    


In [159]:
#Test Function 5

print("Subjects, direct objects and indirect objects of the sentence", '\033[1m' + txt + '\033[0m', "are: \n", getSentenceDep(txt))

Subjects, direct objects and indirect objects of the sentence [1mI saw a man with a telescope.[0m are: 
 {'subjects': ['I'], 'direct objects': ['a', 'man', 'with', 'a', 'telescope'], 'indirect objects': []}
