<a href="https://colab.research.google.com/github/Teaganstmp/Langlearning/blob/main/Copy_of_Problem_Set_2_Universal_Dependencies_Student_Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Dependency Lengths with the Universal Dependencies Corpus

In this advanced option for the problem set, you are going to explore the Universal Dependencies corpus in order to study **dependency lengths**, a feature of language that has been shown to have many psycholinguistically important properties.

A dependency (as you know from our work so far with the UD corpus) is a structural relationship between semantically connected words that does not necessarily depend on the linear distance between those words. It has been argued that dependencies are minimized in language and that this can explain certain preferences. For instance, you can say:

(1) "I threw the trash out"

or

(2) "I threw out the trash"

In these sentences (which speakers seem to find equally good), there is a dependency relationship between "threw" and "out".

But many English speakers have a preference for (3) over (4):

(3) "I threw out the trash that was sitting on the porch and beginning to smell."

(4) "I threw the trash that was sitting on the porch and beginning to smell out."

The claim is that (4) is dispreferred because of the very long dependency between "threw" and "out".

See Futrell, Mahowald, Gibson (2015) https://mahowak.github.io/assets/pdf/dep.pdf, which used the Universal Dependencies corpus. In this paper, we operationalize dependency lengths in a simple way: the number of words from a head to its dependent. So for "I threw out the trash", the dependency distance between "threw" and "out" is 1. In "I threw the trash out", the distance between "threw" and "out" is 3.

We want you to explore some questions related to this topic below, using the Universal Dependencies corpus to assess whether dependency relationships are minimized in human languages.

Show all your code here, and include nice comments and explanatory text. When there are questions we ask you to answer in prose, use the text box instead of the code box.

# Some code to get going with the UD Corpus

In [None]:
#Code and text from https://colab.research.google.com/drive/1d7LO_0665DYw6DrVJXXautJAJzHHqYOm#scrollTo=4WwZYkNr1bPN
#This cell loads the Universal Dependecies Treekbank corpus. It'll download all the packages, but we'll only use the GUM
#english package. We'll also install the conllu package, that was developed to parse data in the conLLu format, a
#format common of linguistic annotated files. We'll also have a list variable, but now named ud_treebank.

#Install conllu package, download the UD Treebanks corpus and unpack it.
!pip install conllu
!wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz
!tar zxf ud-treebanks-v2.5.tgz

#The imports needed to open and parse the conllu file. At the end we'll have a list of dicts.
from io import open
import conllu
import glob
from collections import defaultdict
import numpy as np
import pandas as pd
import random

# this code gets all the languages
ud_files = glob.glob("ud-treebanks-v2.5/*/*-test.conllu")

Collecting conllu
  Downloading conllu-5.0.1-py3-none-any.whl.metadata (21 kB)
Downloading conllu-5.0.1-py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-5.0.1
--2024-09-01 23:14:34--  https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz
Resolving lindat.mff.cuni.cz (lindat.mff.cuni.cz)... 195.113.20.140
Connecting to lindat.mff.cuni.cz (lindat.mff.cuni.cz)|195.113.20.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 355216681 (339M) [application/x-gzip]
Saving to: ‘ud-treebanks-v2.5.tgz’


2024-09-01 23:14:54 (18.4 MB/s) - ‘ud-treebanks-v2.5.tgz’ saved [355216681/355216681]



# Problem 1: our first analysis of dependency lengths

Your task is to, for each language in `ud_files` above, find the average dependency length of all the dependencies in that language, across the entire corpus.

So for "The dog ran", you'll find a dependency of length 1 between "the" and "dog", a dependency of length "1" between "dog" and "ran". So the average is 1. For our purposes, you can compute a dependency length by just subtracting the head ID from the dependent ID and taking the absolute value `abs()` since we don't want to deal with negative dependency lengths: we care about the distance.

There are likely some choices you will have to make in how to do this: there is no one right way to do it, but we encourage you to document and validate your choices. (Think especially carefully about choices that could affect the linguistic conclusions you draw. N.B.: how do you think dependency lengths relate to sentence lengths?)

Below, we guide you through how to get started.

# Problem 1a

To get you started, here is some starter code looking at one English sentence from the GUM corpus: "He was interviewed by Wikinews."

To play around with that sentence, here is some code to put it into a nicer format `nicedata'. You can use this format for your own analyses.

Play around with this code to get a sense of what it does.

In [None]:
# some starter code to explore one sentence and put it in a nice format
# you will want to get this into a function
file_path = "ud-treebanks-v2.5/UD_English-GUM/en_gum-ud-train.conllu"

with open(file_path, 'r', encoding='utf-8') as f:
  data = f.read()

# Parse the file using the conllu library
sentences = conllu.parse(data)
s = sentences[1991]
print(s) # Our sentence is: "He was interviewed by Wikinews."
nicedata = [{"token": None,
             "id": i,
             "deps": [],
             "pos": None}
            for i in range(len(s) + 1)]
for token in s:
    if token["head"] is not None and token["id"] != 0: # we need to be careful here!
      nicedata[token["id"]]["token"] = token["form"]
      nicedata[token["id"]]["pos"] = token["upos"]
      nicedata[token["head"]]["deps"] += [(token["id"], token["deprel"])]

for token in nicedata:
  print(token)

TokenList<He, was, interviewed, by, Wikinews, ., metadata={sent_id: "GUM_interview_shalev-6", text: "He was interviewed by Wikinews.", s_type: "decl"}>
{'token': None, 'id': 0, 'deps': [(3, 'root')], 'pos': None}
{'token': 'He', 'id': 1, 'deps': [], 'pos': 'PRON'}
{'token': 'was', 'id': 2, 'deps': [], 'pos': 'AUX'}
{'token': 'interviewed', 'id': 3, 'deps': [(1, 'nsubj:pass'), (2, 'aux:pass'), (5, 'obl'), (6, 'punct')], 'pos': 'VERB'}
{'token': 'by', 'id': 4, 'deps': [], 'pos': 'ADP'}
{'token': 'Wikinews', 'id': 5, 'deps': [(4, 'case')], 'pos': 'PROPN'}
{'token': '.', 'id': 6, 'deps': [], 'pos': 'PUNCT'}


Above we see the dependencies in our `nicedata` structure. It seems like we probably shouln't count the 0th element (which points to "root") or the punctuation. Why is that?

So what we do want to count are these dependencies:
- interviewed (3) to he (1)
- interviewed (3) to was (2)
- interviewed (3) to Wikinews (5)
- Wikinews (5) to by (4)

Computing this manually, we should get dependency lengths of [2, 1, 2, 1] for a mean of 1.5. Write code to do this by processing `nicedata`.

Confirm that for this sentence you get a length of 1.5.


In [None]:
# TODO

# a nice thing to do when coding is to use unit tests
# to make sure things are working as expected
# if deplengths is your list of dependency lengths for the sentence
# the below code will check that the mean is equal to 1.5
mean_deplengths = np.mean(deplengths)
assert(mean_deplengths == 1.5)

1.5


## Problem 1b

Using the steps above as you see fit, now actually compute the dependency lengths, across all dependences in each corpus.

Dependency lengths are confounded with sentence length: a sentence of length 5 can't have a dependency length of 100! To get around this, you might choose to filter your data to look at only sentences of the same length (length 10 might be a good length to pick).

You should print out the name of the language + its corpus information (so you will have multiple Englishes, for instance) along with its average dependency length. To get a nice average over a list, rounded to two decimals you can use `round(np.mean(LIST), 2)` where LIST is the name of your list of dependency lengths.

In [None]:
def get_nice_data(fn):
  """TODO: write a function which takes in the filename from the list in ud_files,
  return a dictionary with the nicedata format and a language"""
  return {"lang": lang, "nicedata": nicesents}


# TODO: run get_nice_data() to get all the languages in a nicer format

In [None]:
def get_dep_lengths(nicedata):
  """TODO: compute the dependency lengths on nice data,
  return the list of dependency lengths"""
  return deplengths

# TODO: compute the average dependency length for each language

# Problem 2

Now we want to start tackling the problem of whether dependency relations are minimized. But how do we do this? So, ok, let's say the average dependency length in English is 6.7 or whatever. Is that a lot? A little? We need to compare to something!

What should we compare to? Let's try something simple: a random baseline. Instead of computing the dependency lengths as they really are, let's compare them to a random baseline.

To do this, take all the sentences as you have them and create a new `id` for each token called `random_id` which randomly pairs an id with each token.

So if you have:
The id: 1
dog id: 2
chased id: 3
the id: 4
cat id: 5

It might become:
The id: 1 random_id: 4
dog id: 2 random_id: 3
chased id: 3 random_id: 1
the id: 4 random_id: 5
cat id: 5 random_id: 2

Now do your analyses from Problem 1 (note that you should create functions in Problem 1 that you can re-use!) for both `id` and `random_id`. Get the average dependency length for each language, using `id` and `random_id`.

We walk through how to do that below.

# Problem 2a

To get started, let's play around with a nice way to do the random baseline by permuting the ids. Your task is to compute the dependency lengths using our new order, assuming that the dependency structure is the SAME as with the old order (so there is still a dependency between "wikinews" and "by", for instance).

Below we take our same sentence: "He was interviewed by Wikinews."
and change it to  "was he by . interviewed Wikinews"

We still keep the dependents and heads the same, but they have a new order. So we get new dep lengths:
- interviewed (5) to he (2)
- interviewed (5) to was (1)
- interviewed (5) to Wikinews (6)
- Wikinews (6) to by (3)

Averaging these together we get [3, 4, 2, 3] for a mean of 2.75.

Below, write some code to compute the dependency lengths and make sure that you get 3.



In [None]:
file_path = "ud-treebanks-v2.5/UD_English-GUM/en_gum-ud-train.conllu"

with open(file_path, 'r', encoding='utf-8') as f:
  data = f.read()

# Parse the file using the conllu library
sentences = conllu.parse(data)
s = sentences[1991]
print(s) # Our sentence is: "He was interviewed by Wikinews."
nicedata = [{"token": None, "id": i, "deps": [], "pos": None} for i in range(len(s) + 1)]
for token in s:
    if token["head"] is not None and token["id"] != 0: # we need to be careful here!
      nicedata[token["id"]]["token"] = token["form"]
      nicedata[token["id"]]["pos"] = token["upos"]
      nicedata[token["head"]]["deps"] += [(token["id"], token["deprel"])]

# let's assume now that we have some random order
# "he was interviewed by wikinews ." > "was he by . interviewed Wikinews"
# so we can learn a mapping from the old index to the random index (keeping 0
# the same)
mapping_from_old_to_rand = {0: 0,
                   1: 2,
                   2: 1,
                   3: 5,
                   4: 3,
                   5: 6,
                   6: 4}
rev_map = {mapping_from_old_to_rand[i]: i for i in mapping_from_old_to_rand}

# print the new order
for i in range(len(nicedata)):
  print(nicedata[rev_map[i]]["token"])

# TODO: get the dependency lengths for the new sentence


TokenList<He, was, interviewed, by, Wikinews, ., metadata={sent_id: "GUM_interview_shalev-6", text: "He was interviewed by Wikinews.", s_type: "decl"}>
None
was
He
by
.
interviewed
Wikinews
2 5
1 5
6 5
3 6
[3, 4, 1, 3]
2.75


Below, you can see how to do this for a different random order each time. Run the code below a few times to see the different random orders that emerge. Note that we keep the 0th position the same so that we don't mess with the root.

In [None]:
# we can make a new random order, keeping the 0th position constant
randlist = [0] + random.sample(range(1, len(nicedata)), len(nicedata) - 1)
mapping_from_old_to_rand = {i: randlist[i] for i in range(len(randlist))}
rev_map = {mapping_from_old_to_rand[i]: i for i in mapping_from_old_to_rand}

# print the new order
print(" ".join([nicedata[rev_map[i]]["token"] for i in range(1, len(nicedata))]))

interviewed . was He Wikinews by


# Problem 2b

Now actually compute the average dependency for each language, comparing it to its random baseline! Feel free to use the code above, and set a new random order per sentence.

For how many langauges, is the mean dependency length for `id` longer than for `random_id`? Print the result and also include some text discussing this.

Are random dependency lengths longer or shorter than real language dependency lengths?

Bonus: Why isn't this an ideal random baseline? You can refer to the 2015 paper linked above to get some ideas for other baselines!

Bonus 2: Outline a procedure for helping us be confident that these results mean what we think. You might consider things like statistical significance, robustness to the choices you made in your analysis, and more.

In [None]:
# TODO: compare the real dependency lenghts to the random baseline!