# Parsing Assignment (M4LP)

The assignment covers dependency parsing. Use a combination of what you learned in class, the instructions, what you already know about coding, and the linked documentation to solve the problems. Please don't use Gemini or other AI tools -- it doesn't work well for this assignment, and will keep you from learning. You can turn them off in Google Colab under Settings.

The assignment has 76 points total, and is worth about 4% of your final grade. Components of this assignment will inspire questions on your midterm.

Please fill in the cell below with your group number and names and who did what.

### Submission

You will generate a number of files in this assignment, but you should just submit your version of this Notebook.

### Environment setup

Run all this code when you start up the notebook to make sure you have everything you need.

original by L.abzianidze@uu.nl, updated my m.fowlie@uu.nl

### Group info

Group number: 8

Group members: Bjorn Klaassen, Noah de Jonge

Who contributed to which exercises (you don't need to be very detailed):

exercises done together: Bjorn: , Noah:

# Environment setup

Run all this code when you start up the notebook to make sure you have everything you need.

## Installation

Import spaCy and download its model. Install Stanza that comes with an interface for CoreNLP. Download CoreNLP. Install modules and prepare the environment for rendering syntactic trees of NLTK. Download a course-specific python package that contains useful tools.

Additionally, you might find the following predefined function(s) handy: [isinstance](https://www.programiz.com/python-programming/methods/built-in/isinstance), [list comprehension](https://www.programiz.com/python-programming/list-comprehension), [f-string](https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/)

In [None]:
import spacy
if spacy.__version__ < '3.8':
    print(f"spaCy v={spacy.__version__} but it should be >= 3.8\nForce install 3.8 with the next cell")

In [None]:
# may require environment restart
# if necessary, uncomment below and run:
# !pip install spacy>=3.8

In [None]:
# install medium-sized English model
!python -m spacy download en_core_web_md

In [None]:
!pip install stanza

In [None]:
import stanza  # may require environment restart
# if necessary, uncomment below and run:
# !pip install stanza>=1.5.0
if stanza.__version__ < '1.5.0':
    print(f"WARNING: stanza v={stanza.__version__}. It should be at least 1.5.0\nIf necessary, force install more recent version with the next cell")

In [None]:
# may require environment restart
# if necessary, uncomment below and run:
# !pip install stanza>=1.5.0

In [None]:
import os
# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
corenlp_dir = './corenlp'
stanza.install_corenlp(dir=corenlp_dir)
# Set the CORENLP_HOME environment variable to point to the installation location
os.environ["CORENLP_HOME"] = corenlp_dir
# Import client module
from stanza.server import CoreNLPClient
# src: https://github.com/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb

In [None]:
# Needed to display NLTK's trees objects
!pip install svgling

In [None]:
# assigntools package is a course specific collection of useful tools
! rm -rf assigntools
! git clone https://github.com/megodoonch/assigntools.git

## Import

In [None]:
import numpy as np
import os, sys
import nltk
from nltk.tree import Tree
from IPython.display import display
from spacy import displacy
import importlib
from typing import List, Iterable
from collections import defaultdict

In [None]:
# Course-specific package
from assigntools.M4LP.A1 import read_pickle, write_pickle, download_extract_zip, flatten_list, display_doc_dep

In [None]:
# TEST
print(f"spaCy version: {spacy.__version__}")    # should be >= 3.8
print(f"Python version: {sys.version}")
print(f"NLTK version: {nltk.__version__}")
print(f"stanza version: {stanza.__version__}") # should be >= 1.5.0

## Download

In [None]:
# read-only
# URL of a file that will be used during the assignmnet
SICK_TRIAL_URL =  "http://alt.qcri.org/semeval2014/task1/data/uploads/sick_trial.zip"
files = download_extract_zip(SICK_TRIAL_URL)

## Students' additional modules

If you need any additional modules, import them here.

In [None]:
# IMPORT ALL ADDED AND NECESSARY MODULES HERE (IF ANY)
import csv


## Ex1[10pts]: Extracting sentences

Often when parsing a bunch of sentences, it is a good practice to parse each sentence only once and decrease the parsing time. The number of sentences in this exercises are not too much, so saved parsing time in the end will be ~2-3min, but sometimes in real applications such tricks can save hours.

Complete the following function so that it does what is says in the docstring. Make sure the "..." is removed and that you include comments that explain your code. Feel free to update the docstring as well if you like.

The file the function is supposed to read is tab-seperated-value file. You can use string operations or regex to read the sentences but the best practice is to use ready modules that provide file readers for common file formats.

`SICK_trial.txt` has the format your function needs to be able to extract sentences from. Extract all sentences.



In [None]:
################################################################################
################################## EXERCISE 1 ##################################
################################################################################

def read_tsv_sentences(file_path: str) -> List[str]:
    """
    Takes the path of a tab-seperated-value file and reads all sentences from it
    File is formatted as in SICK_trial.txt.
    Return a list of str (the sentences) that is sorted (in ascending order)
        and duplicate ones are filtered out.
    """
    sentences = set()
    with open(file_path) as file:
      tab_reader = csv.reader(file, delimiter='\t')
      next(tab_reader)
      for row in tab_reader:
        for i in [1, 2]:
          sentences.add(row[i])
      return sorted(sentences)

In [None]:
# A toy data set: 3 sentences from the SICK trial. Useful for testing function behaviour quickly, and even if EX1 isn't done yet.

toy_sick = ['A baby is playing with a doll', 'A baby is playing with a toy', 'A baby tiger is playing with a ball']

In [None]:
# TEST EX1
sents = read_tsv_sentences('SICK_trial.txt')
print(sents[0])
assert sents[:3] == toy_sick, f"first 3 sentences are incorrect"

# Ex2[10pt] and Ex3[1pt]: Parsing and tagging with spaCy

Now it is time to parse sentences. We will use [spaCy](https://spacy.io/) for getting dependency trees of the sentences. In addition to the dependency parsing, spaCy pipeline also does part-of-speech tagging and lemmatization (with other stuff). In this exercise, we write and read spaCy parses to and from CONLL files.

For a quick intro to spaCy, have a look at the following section in the [spaCy tutorial](https://course.spacy.io/en/): sections 1 and 5 in [chapter 1](https://course.spacy.io/en/chapter1), and 4 in [chapter 2](https://course.spacy.io/en/chapter2).
Use attributes of spaCy's [Token objects](https://spacy.io/api/token).
After annotation, tokens come with two pos tags: fine-grained corresponds to [Penn Treebank pos tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) while coarse-grained to [Universal pos tags](https://universaldependencies.org/u/pos/).

A CONLL file is a common format for dependencies. Take a look at [the format for Universal Dependencies](https://universaldependencies.org/format.html) as an example. In this exercise, you'll familiarise yourself with the SpaCy `Doc` object that the parser generates by writing a list of them to a CONLL file with a particular format, as given in the doc string. You can also see exactly what it should write for a toy corpus, below.

Ex2: Complete the `spacy2conll` function so that it does what it says in the docstring. Ex3: Test it by writing the toy dataset below, `toy_parsed`, to file. Manually inspect the file `toy_spacy_sm.conll`. It should look like this:

```
# sent_id = 1
text = A baby is playing with a doll
0	A	a	DET	DT	1	det
1	baby	baby	NOUN	NN	3	nsubj
2	is	be	AUX	VBZ	3	aux
3	playing	play	VERB	VBG	3	ROOT
4	with	with	ADP	IN	3	prep
5	a	a	DET	DT	6	det
6	doll	doll	NOUN	NN	4	pobj

# sent_id = 2
text = A baby is playing with a toy
0	A	a	DET	DT	1	det
1	baby	baby	NOUN	NN	3	nsubj
2	is	be	AUX	VBZ	3	aux
3	playing	play	VERB	VBG	3	ROOT
4	with	with	ADP	IN	3	prep
5	a	a	DET	DT	6	det
6	toy	toy	NOUN	NN	4	pobj

# sent_id = 3
text = A baby tiger is playing with a ball
0	A	a	DET	DT	2	det
1	baby	baby	NOUN	NN	2	compound
2	tiger	tiger	NOUN	NN	4	nsubj
3	is	be	AUX	VBZ	4	aux
4	playing	play	VERB	VBG	4	ROOT
5	with	with	ADP	IN	4	prep
6	a	a	DET	DT	7	det
7	ball	ball	NOUN	NN	5	pobj

```

## Parse and display a corpus

In [None]:
# parsing sentences with spaCy's small model

# load the small model
nlp_sm = spacy.load("en_core_web_sm")

In [None]:
# Parse the toy dataset
toy_parsed = nlp_sm.pipe(toy_sick)
# convert it to a list
toy_parsed = list(toy_parsed)

In [None]:
# Display spaCy's dependency trees with the help of displaCy
for parse in toy_parsed:
    display_doc_dep(parse)

In [None]:
# we can regulate space between tokens, but it might affect readability of labels
display_doc_dep(toy_parsed[0], d=100)

In [None]:
################################################################################
################################## EXERCISE 2 ##################################
################################################################################

def spacy2conll(parsed_sentences, out_path: str):
    """
    Given an iterable over SpaCy parses (e.g. a list), write a CONLL file,
    formatted as follows:
    Elements are tab-separated.
    Word indices are 0-indexed.
    Sentence IDs are the 1-indexed indices from the input
    Add a newline between sentences.
    Format:
    # sent id n
    # text = This is the sentence.
    0 word lemma coarse_pos fine_pos head_index dep_label
    1 word ...
    """
    with open(out_path, "w") as file:
      for sentID, doc in enumerate(parsed_sentences, start=1):
        file.write(f"# sent id {sentID}\n")
        file.write(f"# text = {doc.text}\n")
        for token in doc:
          file.write(f"{token.i}\t{token.text}\t{token.lemma_}\t{token.pos_}\t{token.tag_}\t{token.head.i}\t{token.dep_}\n\n")


In [None]:
################################################################################
################################## EXERCISE 3 ##################################
################################################################################

# Use spacy2conll to write the parsed toy corpus to toy_spacy_sm.conll
spacy2conll(toy_parsed, "toy_spacy_sm.conll")

## Ex4 [10pts] and Ex5 [1pt]

Now reverse the process. Write a function to read in a file like the one generated in Ex 2 above, and return a list of `spacy.token.Doc` objects.

A few notes:

* The `vocab` argument is needed to initialise a `Doc`. When you run this function, use the `vocab` from the model used to parse it. For the small model loaded in this Notebook, that's `nlp_sm.vocab`.
* There's a bug in `Doc.__init__` that adds an extra space at the end of `Doc.text` when you create it directly with the `__init__` function instead of by parsing with the model. To help you check your work, we provide `check_doc_equality` that checks all attributes that are included in the CONLL file.
* You'll probably find the [documentation](https://spacy.io/api/doc) for `Doc` helpful

Then test your function by reading back in your conll file and checking the Docs against the original parses (Ex5).

In [None]:
################################################################################
################################## EXERCISE 4 ##################################
################################################################################

def conll2spacy(input_path, vocab):
    """
    Given a path to a CONLL file and a SpaCy Vocab object,
    read in the conll file and return a list of SpaCy Docs
     containing the same information.
    @param input_path: str: path to conll file
    @vocab spaCy Vocab object: the vocab of the model used to parse the sentence
    @return: list of Docs
    """
    ...

In [None]:
# Provided

def check_doc_equality(doc1, doc2):
    """
    A bug in the SpaCy Doc initaliser adds a space to the end of the
     doc.text when built from components, rather than from the parser.
    To get around it, check equality of Doc objects with this function.
    """
    if len(doc1) != len(doc2):
        return False
    for token1, token2 in zip(doc1, doc2):
        if token1.text != token2.text: return False
        if token1.lemma != token2.lemma: return False
        if token1.tag != token2.tag: return False
        if token1.pos != token2.pos: return False
        if token1.head.i != token2.head.i: return False
        if token1.dep != token2.dep: return False
        return True

In [None]:
################################################################################
################################## EXERCISE 5 ##################################
################################################################################

# Read in the conll file you wrote and check whether each of the 3 entries
# is correct. Use check_doc_equality.

# Display the original and re-read parses with display_doc_dep
# and visually inspect them

## Ex6 [3pts]: Parse the real corpus

Perform the following tasks. Print messages to the screen as appropriate.

1. Parse the full SICK trial corpus with the `nlp_sm` model
2. Write them to a conll file `full_spacy_sm.conll`
2. Read them back in
2. Check the re-read-in `Doc`s are the same as the originals
2. Visually inspect three parses (not the first three)
2. Parse the full corpus with the medium English model (see next cell)
2. Write the medium-parsed corpus to `full_spacy_md.conll`

Note: Because this is a small corpus and a fast parser, you can choose, when you need the parsed corpus later, whether to read it in from file or just use what's in the memory from the parsing earlier in the Notebook. If you're finding any of this slow, you may want to download the conll files to your own device, so you don't have to re-parse in future.

In [None]:
# parsing all sentences with spaCy's medium model

# load the medium model
nlp_md = spacy.load("en_core_web_md")


## Ex7[10pt] and Ex8[1pt]: Projectivity

Use spaCy's [Token attributes or methods](https://spacy.io/api/token) related to dependency annotations. This will make code much much simpler.

As you learned in class, not all dependency trees are projective (see example below). Although spaCy uses an Arc-Eager algorithm, [it has some additional functionality](https://spacy.io/api/dependencyparser/) which makes non-projective parses possible as well.

Use the definition in the lecture (and text book) to complete `is_projective` so that it checks the projectivity of a given spaCy `Doc`.

Sentence 689, at least in spaCy 3.8.11, has a non-projective parse from the small model and a projective parse from the medium model. To get a look at a non-project and projective tree, you can use `display_doc_dep`. You may want the optional argument `compact=False` to make curved arcs.

In [None]:
# Example projective and non-projective trees
# using compact=False to better display crossing arcs


display_doc_dep(..., d=100, compact=False)  # sm
display_doc_dep(..., d=100, compact=False)  # md

In [None]:
################################################################################
################################## EXERCISE 7 ##################################
################################################################################

def is_projective(doc):
    """
    Checks a dependency tree on projectivity. Uses the definition
        of projective arcs and checks all arcs on projectivity.
    @param doc: spaCy Doc object
    @return Bool (True if projective)
    """
    ...

In [None]:
################################################################################
################################## EXERCISE 8 ##################################
################################################################################

# Test on a projective and non-projective parse


In [None]:
# TEST on all parses from small model
for i, d in enumerate(docs_sm):
    if not is_projective(d):
        print(f"{i}: {d}")
print("Done")

In [None]:
# TEST on all parses from medium model
for i, d in enumerate(docs_md):
    if not is_projective(d):
        print(f"{i}: {d}")
print("Done")

## Ex9 [5pts]

The following sentence should have a non-projective parse, but the even the medium model for SpaCy version 3.8.11 does not predict a non-projective parse. (The model in version 3.7.1 did for some reason!)

_Who did you say Mary likes?_

Parse this sentence with both SpaCy models. Use your function to check them for projectivity. Display them with `display_doc_dep`.

Write in a text (Markdown) cell a short discussion: If you get a non-projective parse, which model yields it, and what edge is non-projective? If not, what edge SHOULD be non-projective? What mistake is the parser making?

Your answer should be code and markdown cells that parse, display, explain, etc.

# Parsing with CoreNLP

Another library for parsing into dependency trees is CoreNLP.

CoreNLP will be used through [Stanza CoreNLP interface](https://github.com/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb). CoreNLP provides both constituency and dependency trees.

In [None]:
# Getting dependency trees from a dependency parser
# takes <1min
# https://stanfordnlp.github.io/CoreNLP/depparse.html

with CoreNLPClient(annotators='tokenize,pos,depparse',
                   memory='4G', endpoint='http://localhost:9021', be_quiet=True,
                   output_format='json') as client:
    core_dep_parses = [ client.annotate(s)['sentences'][0] for s in sents ]

## Ex10[10pts], Ex11[1pt], Ex12[1pt]: From CoreNLP to Doc

CoreNLP dependencies, e.g., `core_dep_parses[0]['basicDependencies']` are a list of dictionaries each corresponding to a token. For trees, it would be handy if the dependencies are formatted as spaCy's [Doc object](https://spacy.io/api/doc), which allows us to display dependency trees (or check projectivity). Read how [Doc](https://spacy.io/api/doc) can be initialized. You should find `core_dep_parses[0]['tokens']` useful for getting values of `spaces` and `tags` arguments.

Ex11: Test your function by displaying the first three parses along with the parses from the other parser for the same sentences. Inspect them and see how they are similar and different.

Ex12: Convert all CoreNLP parses to SpaCy Doc notation and store them in a variable. Write them to a conll file, `full_stanza.conll`. Check their projectivity.

In [None]:
################################################################################
################################## EXERCISE 10 #################################
################################################################################

def coreNLP2Doc(parse):
    """
    Uses info from parse['basicDependencies'] and parse['tokens']
        to initialize and return a Doc object.
    @param parse: a parse from CoreNLP
    @return: spacy.tokens.Doc with the basicDependencies, including tags.
    """
    # initialise a vocabulary for spacy Doc
    vocab = spacy.Vocab()

    ...

In [None]:
################################################################################
################################## EXERCISE 11 #################################
################################################################################

# Test your function

In [None]:
################################################################################
################################## EXERCISE 12 #################################
################################################################################

# parse all SICK trial items and write them full_stanza.conll

core_dep_parses = ...

# Enhanced dependency graphs

In Lecture 4 you learned about Enhanced Dependencies. These are also provided in the Core NLP parses, under the key `'enhancedDependencies'`.


## Ex 13 [10ts]: Generate Dot file

You are provided the function `write_to_dot` that takes a dict form of a graph and writes it as a dot file.

Your job is to complete the function below it, `graph2dictionary`, to create the dict that is input to `write_to_dot`. Use the docstrings of both functions and the code in `write_to_dot` to guide how you build the dict in your function.

The tests below will write a dot file, which you can inspect, as well as download and run Dot on, if you have that working, so that you can view the graph. The file should appear to the left of this Notebook (11.dot) and right-clicking should allow you to download it, and generate an image using `dot`. You can also change the syntax very slightly and put it in a LaTeX file -- see HW1.

You don't need to include the output files in your submission.

In [None]:
# PROVIDED

def write_to_dot(dictionary, dot_file_path):
    """
    Given a graph in dictionary form, write Dot code to a file.
    Dict should be in the form output by graph2dict.

    @param dictionary:
        head (int): dict:
                        "label": node label,
                        "deps": dict:
                            dependent (int): edge label
    @param dot_file_path (str): path to write code to (including filename.dot)
    """
    with open(dot_file_path, 'w') as dot:
        dot.write("digraph g {\n")
        for node in dictionary:
            # write the node and its label
            dot.write(f"{node} [label=\"{dictionary[node]['label']}\"];\n")
            for (dep, label) in dictionary[node]["deps"].items():
                # write the edge and its label
                dot.write(f"{node} -> {dep} [label=\"{label}\"];\n")
        dot.write("}")

In [None]:
################################################################################
################################## EXERCISE 13 #################################
################################################################################


def graph2dictionary(edges):
    """
    Given a dependency list from one parsed sentence from CoreNLP,
    e.g. the 'enhancedDependencies' entry,
    extracts the nodes, edges, and labels into a dict.
    Returns the dictionary.

    @param edges: a list of edges from an entry from CoreNLP (type dict)
    @return dict
    """

In [None]:
# TEST run the function on entry 11's enhanced dependencies
graph_11 = graph2dictionary(core_dep_parses[11]['enhancedDependencies'])

In [None]:
# TEST write to dot file
write_to_dot(graph_11, "11.dot")


## Ex 14 [3pts]

The CONLL file format we have been using will not work for Enhanced Dependencies. Why not?