# Assignment 4b MA : Build your own corpus exploration tool

**Deadline for Assignment 4a+b: Friday, October 9, 2020 (17.00) via Canvas (Assignment 4)** 


In this assignment, you're going to build your own tool for exploring a the **Parallel Meaning Bank** (PMB). This resource is a **parallel corpus**, which means that it contains the **same documents translated into multiple languages**. Such resources are very valuable for many aspects of linguistics and Natural Language Processing (NLP), but most importantly for Machine Translation (ML). 

For this part of assignment 4, you will submit two python scripts called:

* `explore_pmb.py`
* `utils.py`

The corpus contains a lot of data, but not every document is translated into every language. Therefore, we will build a tool which explores different aspects of coverage. Your tool will be able to:

* explore the **overall coverage per language**
* explore the the **parallel coverage of a given language pair** (i.e. how many documents and tokens exist in a language pair?)
* **browse parallel text** in given language pairs

Before diving into building the tool, we're going to guide you through a couple of warm-up examples. You can use them to explore the data structure and write your code. It is permitted to copy-paste bits of code (you will have to modify them to solve all exercises). 

The assignment is structured as follows:

1. Understanding the data structure (code snippets to guide you through the corpus)
2. Writing functions (writing the actual code)
3. Putting the tool together (combining the code)
4. Testing and submission (a final check of whether your code does what it is supposed to do)


You can learn more about the PMB [here](https://pmb.let.rug.nl/). 

If you have **questions** about this chapter, please contact us at cltl.python.course@gmail.com. Questions and answers will be collected in [this Q&A document](https://docs.google.com/document/d/1551Db87zckRPbKDosZ4105htEUxVWZu9ejDj3MM8qck/edit?usp=sharing), so please check it before you email. 

**Tip**: Read the entire assignment before you start writing code. Try to understand the tool we're building before you start. Making notes with pen and paper can be very helpful.

## 1. Understanding the data structure

In this part, we guide you through the data structure. You can use the code below for the rest of your assignment. You can play with the code and add things to it, but you will not receive points in this part. Its purpose is to make you familiar with the data structure.  

### 1.a Download the data

1.) Please go to this website: [here](https://pmb.let.rug.nl/data.php)

2.) Select version 2.1.0 (the latest version is too big for our purposes) and store the zip file as `PMB/pmb-2.1.0.zip` on your computer (remember where).

3.) Unpack the data. You can do this from the terminal by navigating to the directory using `cd`. You should be able to run `unzip pmb-2.1.0.zip` to unzip the file. Alternatively, you can simply unzip by clicking on it. Attention: Unpacking the file may take a while. 

Use the cell below to assign the path to the data to a variable. We will only consider the gold data for this assignment, therefore you can add the gold directory to the path.

Path: `'PMB/pmb-2.1.0/data/gold/'`

**Please run the following cell to check if your data are in the right place. If they are, it will not print anything.**

In [2]:
import os

my_path = "/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/"
# e.g.:
# my_path = '/Users/piasommerauer/Data/'

path_pmb = f"{my_path}PMB/pmb-2.1.0/data/gold"
assert os.path.isdir(path_pmb), "corpus data not found"

### 1.b Parallel documents 
Before we can build anything, we have to understand how the data are strucutred. We start by looking at a single document. 

Parallel documents are stored in the same document directory (d+number). The filenames indicate the language (e.g. en = English). The data we're interested in are stored in .xml format. Run the cell below to inspect the filepaths of a single document. Feel free to modify the path to inspect other documents. 

In [3]:
import glob

test_part = "p27"
test_document = "d0852"

test_doc_path = f'{path_pmb}/{test_part}/{test_document}/'
test_doc_files = glob.glob(f'{test_doc_path}*.xml')

for f in test_doc_files:
    print(f)

/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold/p27/d0852\de.drs.xml
/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold/p27/d0852\en.drs.xml
/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold/p27/d0852\nl.drs.xml


### 1.c XML structure of a single document

Below, we access a single document and load the xml structure using lxml.etree. Run the cell to print the xml tree. 

Explore the document structure and try to answer these questions:

* Where can you find the full text of the document?
* Where can you find information about each token in the text?

In [4]:
from lxml import etree

test_doc_path_en = test_doc_path+'en.drs.xml'
doc_tree = etree.parse(test_doc_path_en)
doc_root = doc_tree.getroot()
etree.dump(doc_root, pretty_print=True)

<xdrs-output version="'boxer v1.00 (unix build on 24 May 2018, 11:14:54)'">
<!-- I 'm not tired at~all . --> 

<xdrs xml:id="xdrs1">
 <taggedtokens>
  <tagtoken xml:id="i1001">
   <tags>
     <tag type="verbnet" n="0">[]</tag>
     <tag type="tok">I</tag>
     <tag type="sym">speaker</tag>
     <tag type="lemma">speaker</tag>
     <tag type="from">0</tag>
     <tag type="to">1</tag>
     <tag type="pos">PRP</tag>
     <tag type="sem">PRO</tag>
     <tag type="wordnet">O</tag>
   </tags>
  </tagtoken>
  <tagtoken xml:id="i1002">
   <tags>
     <tag type="verbnet" n="0">[]</tag>
     <tag type="tok">'m</tag>
     <tag type="sym">be</tag>
     <tag type="lemma">be</tag>
     <tag type="from">1</tag>
     <tag type="to">3</tag>
     <tag type="pos">VBP</tag>
     <tag type="sem">NOW</tag>
     <tag type="wordnet">O</tag>
   </tags>
  </tagtoken>
  <tagtoken xml:id="i1003">
   <tags>
     <tag type="verbnet" n="0">[]</tag>
     <tag type="tok">not</tag>
     <tag type="sym">not</tag>
     <

## 2. Writing functions

In this part of the assigment, we guide you through writing the functions for your tool. Feel free to use the notebook for exploration, but your final functions should be stored in `utils.py`. 

### 2.a Get all token elements of a document in a given language

Write a function which fulfills the following requirements: 

* Positional parameter: path to the document in a particular lanugage 
* Output: list of token elements (the token elements are called 'tagtoken')

In [5]:
def get_tokens(path_to_doc):
    tree = etree.parse(path_to_doc)
    root = tree.getroot()
    tags = root.findall("xdrs/taggedtokens/tagtoken")
    return tags
            
    #pass 
    
# Test you function
test_part = "p27"
test_document = "d0852"
language = "en"
test_doc_path = f"{path_pmb}/{test_part}/{test_document}/{language}.drs.xml"
# Function call
tokens = get_tokens(test_doc_path)
print(tokens)

#assert len(tokens) == 6 and type(tokens[1]) == etree._Element, "token list not correct"

[<Element tagtoken at 0x23121c11b40>, <Element tagtoken at 0x23121c11b80>, <Element tagtoken at 0x23121c11bc0>, <Element tagtoken at 0x23121c11c00>, <Element tagtoken at 0x23121c11c40>, <Element tagtoken at 0x23121c11c80>]


### 2.b Get token and pos from a token element

Write a function which fulfills the following requirements: 

* Positional parameter: token element
* Output: token (string) and part of speech tag (string) of the token element

An example token element is shown below. (You can use it for testing.)

In [11]:
test_token_str = """
 <tagtoken xml:id="i1002">
   <tags>
     <tag type="verbnet" n="0">[]</tag>
     <tag type="tok">'m</tag>
     <tag type="sym">be</tag>
     <tag type="lemma">be</tag>
     <tag type="from">1</tag>
     <tag type="to">3</tag>
     <tag type="pos">VBP</tag>
     <tag type="sem">NOW</tag>
     <tag type="wordnet">O</tag>
   </tags>
 </tagtoken>
"""

test_token = etree.fromstring(test_token_str)
print(test_token)

<Element tagtoken at 0x1df881c9880>


In [27]:
def get_token_pos(token_element):
    tags = token_element.findall("tags/tag")
    for tag in tags: 
        if tag.get("type") == "tok":
            tag_token = tag.text
        elif tag.get("type") == "pos":
            tag_pos = tag.text
    return tag_token, tag_pos
    #pass

# Test your function using the first token 
#token, pos = get_token_pos(test_token)
token, pos= get_token_pos(test_token)
#pos = get_token_pos(test_token)
print(token, pos)

#assert token == "'m" and pos == 'VBP', 'token and pos not correct'

'm VBP


### 2.c Get document text

Define a function with the following requirements:

* Positional parameter: filepath of a document in a particular language (i.e. full, relativ filepath)
* Output: the text of the document as a string

**Hint**:
 
There are two options to get the document text of a file:

* Option 1: Access the comment indicated by `<!--  -->`. Look at the file above to find the comment. You will see that it contains the entire text represented in the xml file. You can access it by iterating over the child-elements of the root. Try this out in the notebook before defining your function. You can get started using the code below.


* Option 2: Use the tokens. You can collect all the tokens in a document using the functions we have defined above. Once you have all tokens, you can use a string method to join them with whitespaces between them.

Only implement **one** of these options. 

In [81]:
# Code snippet for option 1 

# use the test document
test_doc_path_en = test_doc_path#+'en.drs.xml'
# load
doc_tree = etree.parse(test_doc_path_en)
# get root 
root = doc_tree.getroot()
# iterate over child-elements
for ch in root.getchildren():
    print('tag', ch.tag)
    print('text', ch.text)

tag <cyfunction Comment at 0x000001DF880EF520>
text  I 'm not tired at~all . 
tag xdrs
text 
 


In [13]:
def get_doc_text(path_to_doc):
    doc_tree = etree.parse(path_to_doc)
    root = root = doc_tree.getroot()
    tags = root.findall("xdrs/taggedtokens/tagtoken/tags/tag")
    token_list = []
    for tag in tags: 
        if tag.get("type") == "tok":
            tag_token = tag.text
            token_list.append(tag_token)
    token_string = " ".join(token_list)
    return token_string
    #pass

# Test you function
test_part = 'p27'
test_document = 'd0852'
language = 'en'
test_doc_path = f'{path_pmb}/{test_part}/{test_document}/{language}.drs.xml'

text = get_doc_text(test_doc_path)
print(text)

#assert text == "I 'm not tired at~all .", 'doc text not correct'

I 'm not tired at~all .


### 2.d Sort documents on languages 

To explore the coverage of the corpus, it is convenient to store the documents as sets mapped to their language. We can then use set methods (i.e. intersection) to check which documents exist in a given language pair. 

Write a function which fulfills the following criteria:

* mandatory positional argument: path to the corpus (e.g. '../../../Data/PMB/pmb-2.1.0/data/gold')
* output: a dictionary of the following format:
            `{
              'language1': {'path_to_doc1', 'path_to_doc2', ...},
              'language2': {'path_to_doc1', 'path_to_doc4', ...},
              'language3': {'path_to_doc2', 'path_to_doc3', ...},
              }`
       
       
Hints:

* filepaths are strings; you can manipulate them using string methods (e.g. split on '/' or '.'). 
* The os mudule has a convenient way of extracting the filename from a long path (i.e. the last bit of the path): os.path.basename(your_path)
* Feel free to use [defaultdict](https://docs.python.org/3/library/collections.html#collections.defaultdict) (with a set as the default value) (`from collections import defaultdict`)

In [32]:
# Example for filepath manipulation:
import os 

my_path = '/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold\p00/d0712\de.drs.box'
f_name = os.path.basename(my_path)
print(f_name)
print(type(f_name))
remaining_path = my_path.rstrip(f_name)
print(remaining_path)
name = f_name.split('.')
print(name[0])

de.drs.box
<class 'str'>
/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold\p00/d0712\
de


In [23]:

import glob 
import os 

def sort_documents(path_pmb):
    #unpack the gold folder with glob
    en_set = set()
    it_set = set()
    de_set = set()
    nl_set = set()
    for folder1 in glob.glob(f"{path_pmb}/*"):
        for folder in glob.glob(f"{folder1}/*"):
            for file in glob.glob(f"{folder}/*"):
                f_name = os.path.basename(file)
                name = f_name.split(".")
                if name[0] == "en":
                    folder = folder.replace("\\", "/")
                    en_set.add(folder)
                elif name[0] == "it":
                    folder = folder.replace("\\", "/")
                    it_set.add(folder)
                elif name[0] == "de":
                    folder = folder.replace("\\", "/")
                    de_set.add(folder)
                elif name[0] == "nl":
                    folder = folder.replace("\\", "/")
                    nl_set.add(folder)
        
            
    d = {}
    d["en"] = en_set
    d["it"] = it_set
    d["de"] = de_set
    d["nl"] = nl_set
    
    return d 
            
    #pass


# Test you function:
my_path = "/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/"
path_pmb = f"{my_path}PMB/pmb-2.1.0/data/gold"
language_doc_dict = sort_documents(path_pmb)
print(language_doc_dict)

#n_en = len(language_doc_dict['en'])
#n_it = len(language_doc_dict['it'])
#n_de = len(language_doc_dict['de'])
#n_nl = len(language_doc_dict['nl'])
#print(n_en)
#print(n_it)
#print(n_de)
#print(n_nl)

#assert n_en == 4555 and n_it == 635 and n_de == 1175 and n_nl == 586, 'sorting not correct'

{'en': {'/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold/p31/d1675', '/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold/p15/d2078', '/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold/p00/d1503', '/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold/p07/d2761', '/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold/p08/d3277', '/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold/p00/d2170', '/Users/elena/OneDrive/Desktop/python-for-text-analysis-master/python-for-text-analysis-master/Data/PMB/pmb-2.1.0/data/gold/p12/d0922', '/Users/elena/OneDrive/Desktop/python-fo

## 3. Putting the tool together

Congratulations! You've written most of the code already! 

From now on, we will mostly use the functions defined above and combine them in the file called `explore_pmb.py`. 

Some code snippets are provided here to help you along the way. 

### 3.a Printing statistics for all languages

Let's start by exploring the coverage of all languages individually. In `explore_pmb.py`, write the following code:

* Import the function`sort_docs`, call it and assign the output dictionary to a variable called `language_doc_dict`. Don't forget to define the path to the corpus, which you use as function input. 
* Create a list of all lanugages (hint: you can simply use the keys of `language_doc_dict`). 
* For each lanugage, print the following:
    `[Language]: num docs: [number of documents], num tokens: [number of tokens]
    
Hints:

* Each document is unique - you can simply count the elements in the sets to get the number of documents.
* Use the function `get_tokens` to access the tokens. Then count them.
* You will most likly use two nested loops: An outer loop for languages and an inner loop to access the tokens in the documents. 
* Use f-strings to print the results.


In [None]:
# some code to help you along the way (you can also start from scratch)
languages = # your code here

for languagage in languages:
    n_docs = # your code here
    n_tokens = # your code here
    docs = # your code here
    # your code here
    for doc in docs:
        path_to_doc = f'{doc}/{language}.drs.xml'
        tokens = get_tokens(path_to_doc)
        # your code here
    print(f'{language}: num docs: {n_docs}, num tokens: {n_tokens}')
        

### 3.b Printing statistics for language pairs 

We also want to explore the coverage of **parallel data** for the lanugage pairs in the corpus. To do this, use an additional loop to iterate over all possible lanugage pairs in the parallal meaning bank and print the number of documents which exist for both languages. 

Use the function below to generate the lanugage pairs. Simply copy-paste it into the script called `utils.py` and import it into `explore_pmb.py`. Use the cell below to explore how it works. 

The list of language pair should look similar to this (and contain all possible pairs):

`pairs = [(‘nl’, ‘en’), (‘it’, ‘de’), (‘en’, ‘it’)]`

Print the following for each lanugage pair (use f-strings):

`Coverage for parallel data in [language_1] and [language_2]: [number of documents]`


Hints:

* You can unpack tuples in a loop.
* Use a set method to get the document paths covered by both languages. Then simply count the paths.
* You do not have to match the file-contents. Instead, use the information provided in the filepaths (in a previous step, you have sorted your corpus files according to language).  The file paths in the sets (representing the documents) are supposed to consist of the base names only (i.e. no directory paths). You can use set operations to get the overlap between two languages. 


In [None]:
def get_pairs(language_list):
    """Given a list, return a list of tuples of all element pairs."""
    pairs = []
    for l1 in language_list:
        for l2 in language_list:
            if l1 != l2:
                if (l1, l2) not in pairs and (l2, l1) not in pairs:
                    pairs.append((l1, l2))
    return pairs

language_list = ['a', 'b', 'c']
pairs = get_pairs(language_list)
print(pairs)

In [None]:
# Here's a start (feel free to modify this)

for lang1, lang2 in pairs:
    docs_lang1 = language_doc_dict[lang1]
    docs_lang2 = language_doc_dict[lang2]
    # you code here

### 3.c Explore parallel documents 

As a final step, we want let the user browse the parallel documents in a chosen language pair. Write the following code (in `explore_pmb.py`):

* use input() to define two variables: language_1 and language_2
* get the document paths for all documents covered by both languages
* Loop over the documents and print the documents in both lanugages. After each document, ask the user whether they want to continue. If the answer is 'no', stop. Else, show the next. 


### Bonus: Come up with your own comparison

Got insterested in parallel data? Extract a comparison you find interesting! 

**This is an additional exercise - it is not required to complete this to get a full score.** 

If you complete this exercise, you can earn up to 3 additional points which can be used to make up for other points you missed. Note that you cannot get more than a full score. 

## 4. Testing & submission

Congratulations! You've built a corpus exploration tool! Before you submit, please make sure to test your code:

* Can you run the script `explore_pmb.py` from the command line?
* Do you get a general corpus overview (see 3.a)?
* Do you get an overview of language pairs (see 3.b)?
* Are you asked to provide a lanugage pair and do you see examples of parallel documents once you entered a pair (see 3.c?)

If you did not manage to complete all of the exercises, submit what you have and, if possible, explain how you were going to solve them. You get points for intermediate steps. 

**Please submit python scripts only. You can use this notebook for exploration and development, but we will not consider the code written here.**
