<a href="https://colab.research.google.com/github/harisont/comp-syntax-2020/blob/master/lab1/ch1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1: explore the parallel UD treebank (PUD)

## Question 4 

First, we read the relevant data:

In [0]:
from pandas import read_csv # cause it's easier to abuse read_csv

# assuming the conllu files are in the current folder
en_table = read_csv('en_pud-ud-test.conllu', sep="\t", comment="#", usecols=[1,3,7], names=["token", "POS tag", "Dep. label"])
es_table = read_csv('es_pud-ud-test.conllu', sep="\t", comment="#", usecols=[1,3,7], names=["token", "POS tag", "Dep. label"])

We then get the top 10 tags and dependency labels and their number of occurrences.

In [0]:
from nltk import FreqDist

def top10(col):
  return FreqDist(col).most_common(10)

top10_en_tags = top10(en_table["POS tag"])
top10_es_tags = top10(es_table["POS tag"])
top10_en_deps = top10(en_table["Dep. label"])
top10_es_deps = top10(es_table["Dep. label"])

In [0]:
def print_sidebyside(top10en, top10es):
  fmt = '{:<5}{:<20}{}'
  print(fmt.format('', 'English', 'Spanish'))
  for i, (p1, p2) in enumerate(zip(top10en, top10es)):
      print(fmt.format(i + 1, str(p1), str(p2)))

As shown below, in both treebanks `NOUN` and `ADP` are respectively the first and second most frequent POS, but the number of adpositions in Spanish is way bigger than it is with respects to English. This is not surprising, since, for instance, there are many verbs that are directly followed by the objet in English but require a preposition in Spanish (e.g. _dream_ vs _soñar con_).

Unlike in Spanish, the third most frequent tag in the English is `PUNCT`, indicating that English sentences and clauses are generally shorter than their Spanish counterparts.

Another way in which the two languages differ significantly is the amount of determiners. This may be due to the fact that in English it is more common to omit articles, for instance in sentences like "Children should go to the park every day" (where we say something about children in general), which can be translated to Spanish as "_Los_ niños tendrìan que ir al parque todos los días".

In [57]:
print_sidebyside(top10_en_tags, top10_es_tags)

     English             Spanish
1    ('NOUN', 4040)      ('NOUN', 4721)
2    ('ADP', 2493)       ('ADP', 4167)
3    ('PUNCT', 2301)     ('DET', 3260)
4    ('VERB', 2156)      ('PUNCT', 2216)
5    ('DET', 2086)       ('VERB', 2115)
6    ('PROPN', 1727)     ('ADJ', 1434)
7    ('ADJ', 1540)       ('PROPN', 1220)
8    ('PRON', 1021)      ('PRON', 1035)
9    ('AUX', 1014)       ('ADV', 895)
10   ('ADV', 849)        ('AUX', 735)


In [58]:
print_sidebyside(top10_en_deps, top10_es_deps)

     English             Spanish
1    ('case', 2499)      ('case', 3648)
2    ('punct', 2301)     ('det', 3425)
3    ('det', 2047)       ('punct', 2216)
4    ('nsubj', 1393)     ('nmod', 1766)
5    ('amod', 1336)      ('obl', 1521)
6    ('obl', 1237)       ('amod', 1276)
7    ('nmod', 1076)      ('nsubj', 1168)
8    ('root', 1000)      ('root', 996)
9    ('obj', 876)        ('advmod', 829)
10   ('advmod', 852)     ('obj', 773)
