# Exercise Session 3: Alignment and Querying Systems

## Alignment

In this part of the session, we are trying to get the horizons corpus aligned, and have it in a similar format in which [OPUS](http://opus.nlpl.eu/) makes their parallel corpora available.

```xml
<document src="de" trg="en">
    <sent_pair id="0">
        <src lang="de">Der Fuchs spricht in fremden Zungen.</src>
        <trg lang="en">The fox speaks with strange tongues.</trg>
    </sent_pair>
</document>
```

If we use this format for the english and the german part of the horizons corpus, we gain some information, but we also lose some. Textual coherence that spans multiple sentences doesn't necessarily get preserved. We also have to assume that in the articles, there are not only 1:1-relationships between the german and the english sentences. Maybe a german sentence gets paraphrased by two or more english sentences. This format doesn't facilitate n:n-relationships. Nonetheless it is useful, maybe as a test set for a MT-application, or as the grounds of a statistically computed dictionary.

### Aligning

There are different alignment-tasks, namely document, sentence and word alignment. They all require to some extent different algorithms and techniques. Today we are only focusing on sentence alignment. The document alignment has already been done when we crawled the web for the corpus: We were able to use the semantic links between the articles to identify corresponding documents.

To align sentences, there can be different approaches:

- Length-based: [Gale-and-Church-Algorithm](https://www.nltk.org/_modules/nltk/translate/gale_church.html)
- Length-and-Dictionary-based: [hunalign](https://github.com/danielvarga/hunalign)
- MT-based: [Bleualign](https://github.com/rsennrich/Bleualign)

#### Bleualign

We are going to use Bleualign in this session. Bleualign takes as an input the two texts we want to align, as well as a self-made translation of one of the texts. It then uses the similarity between this secondary text and the text in the same language to align the two primary texts. The great thing is, this secondary translation doesn't have to be that good, it just hast to be alright. Therefore, we can easily have a MT-tool do this task.

<img src="img/bleualign.png" alt="bleualign" style="width: 500px;"/>

The metric it uses to determine the similarity between the translation is BLEU, a score widely used in MT to evaluate the quality of an automatic translation.

The core idea behind it is to compare a candidate, an automatic translation, to one or more reference translations. This formula here is a simplified BLEU-score, only taking Unigrams and one Reference Sentence into account:

$$
BLEU_{simple} = \frac{\sum_{\text{Unigram} \in \text{Candidate}}^{}\text{matched(Unigram)}}{\text{Length(Candidate)}}
$$

Let's use an example:

|x|t1|t2|t3|t4|t5|t6|
|-|--|--|--|--|--|--|
|Candidate|the|cat|runs|from|the|dog|
|Reference|the|cat|flees|from|a|dog|
|Reference 2|the|cat|-|-|-|-|-|




$$
\text{matched(the)}=1\\
\text{matched(cat)}=1\\
\text{matched(runs)}=0\\
\text{matched(from)}=1\\
\text{matched(dog)}=1\\\\
BLEU_{simple} = \frac{4}{6} =0.66
$$

Now, you might see a shortcoming of this metric. If the Candidate is really short, maybe has only one token in it, it gets a really good score, because the metric is normalized over the Candidates length. For example, if our Candidate only consisted of "cat", it would get a perfect score, even though it is far from the Reference.

That is why the BLEU score additionally introduces a so called *Brevity Penalty*. If a Candidate is shorter than the Reference, it automatically gets a lower score. This might be a rather brute approach, but it has proven to match rather well with our expectations of good translations.

### Doing the alignment

For now, I only want to align five articles as a proof-of-concept.

First, we need to get the articles into the format we need. Beforehand, I cut out five articles from the corpus, which looks like this:

```xml
<corpus>
    <article id="a1" issue="120" lang="de">
        <div class="title">Titel</div>
        <div class="abstract">Einführung</div>
        <div>Satz 1. Satz 2. Satz 3.</div>
        <div>Satz 4. Satz 5.</div>
    </article>
</corpus>
```

Bleualign expects the documents to have to have an id as a filename, and a suffix corresponding to their status. For example, the three files needed to create an alignment could be called `0.en`, `0.de` and `0.trans`. Additionally Bleualign expects a sentence per line.

First I define a function that lets me print a directory tree, so I don't need to switch editors to check.

In [1]:
import os

def list_files(startpath):
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))

Now the real deal:

In [2]:
from pathlib import Path
from lxml import etree
import nltk
import re

# Path("/my/directory").mkdir(parents=True, exist_ok=True)

def my_parse(lang):
    filename = Path(f"corpus_cutouts/{lang}_corpus_cutout.xml")
    
    Path("./alignment_dir").mkdir(parents=True, exist_ok=True)
    # Iterative Parsing, great for xml-files that don't fit into RAM 
    for _, article in etree.iterparse(str(filename), tag="article"):
        art_id = article.get("id")
        
        with open(f"alignment_dir/{art_id[1:]}.{lang}", "w", encoding="utf8") as outfile:
            for elem in article.xpath("./div[@class='title']|./div[@class='abstract']|./div[not(@class)]"):
                text = elem.text
                if lang == "de":
                    
                    # NLTKs sentence segmentizer works better than spacys if there are quotation marks in the text
                    sent_text = nltk.sent_tokenize(text, language='german')
                elif lang == "en":
                    sent_text = nltk.sent_tokenize(text, language='english')
                    
                for sent in sent_text:
                
                    text = re.split(r'\.»', sent)
                    
                    if len(text)>1:
                        text[0] = text[0] + ".»"

                    for sent in text:
                        if sent == "":
                            continue
    
                        tokenized_sent = nltk.word_tokenize(sent)
                        outfile.write(f"{' '.join(tokenized_sent)}\n")
            # return here to only parse one file
            return
    print("Finished")

In [3]:
my_parse("de")

In [4]:
my_parse("en")

In [5]:
list_files(".")

./
    Exercise Session 3.ipynb
    .ipynb_checkpoints/
        Exercise Session 3-checkpoint.ipynb
    alignment_dir/
        160.de
        160.en
    corpus_cutouts/
        de_corpus_cutout.xml
        en_corpus_cutout.xml
    img/
        bleualign.png


Now, we can start to translate the texts of one of the languages. For this, I am using the [googletrans](https://pypi.org/project/googletrans/)-module, which serves as a Python-interface for Google Translate.

In [6]:
from googletrans import Translator

def obtain_translation(artid):
    inpath = Path(f"alignment_dir/{artid}.de")
    outpath = Path(f"alignment_dir/{artid}.trans")

    translator = Translator()

    with open(inpath, "r", encoding="utf8") as infile:
        with open(outpath, "w", encoding="utf8") as outfile:
            translations = []
            for line in infile:
                translation = translator.translate(line, src="de", dest="en").text
                outfile.write(f"{translation}\n")
                
    print(f"Finished article {artid}")

import translate
    
def no_google_obtain_translation(artid):
    inpath = Path(f"alignment_dir/{artid}.de")
    outpath = Path(f"alignment_dir/{artid}.trans")

    with open(inpath, "r", encoding="utf8") as infile:
        with open(outpath, "w", encoding="utf8") as outfile:
            translator = translate.Translator(to_lang="en", from_lang="de")
            for line in infile:
                trans_line = translator.translate(line)
                outfile.write(trans_line + "\n")
                
    print(f"Finished article {artid}")

Be aware that the Google Translate API has restrictions on how many lines you are allowed translate. I found, that I can only run this script around once per day. Otherwise, it returns a `JSONDecodeError`. For larger corpora, you either need to do some VPN-magic, pay for a translation service or use something local (or on the CL-server). If this isn't in your power,  a length-based algorithm like hunalign is probably the better choice. 

In [7]:
#for i in range(160,165):
#    obtain_translation(str(i))

obtain_translation(160)

Finished article 160


The next step is using Bleualign. It's functionality could be imported to your own script, but it's easier to just use it on the command line.

I am going to use the following call, which enables batch-processing:

`python path/to/batch_align.py aligned_dir de en trans`

This creates two new files per document pair, where there are only matching sentence pairs. Sentences that didn't match up, are discarded. For some use-cases this might not be desired, but for our goals it's just what we need.

In [None]:
list_files("alignment_dir")

In these `.aligned`-files, the lines correspond to each other. The sentence on line number 36 in file `163.de.aligned` should be the translation of the sentence on the same line number in `163.en.aligned`. 

### Creating the final format

First, I create a generator over the sentence pairs. This makes it alot easier to handle later.

In [8]:
def pair_iterator(directory, artid):
    with open(Path(f"{directory}/{artid}.de.aligned"), "r", encoding="utf8") as defile:
        with open(Path(f"{directory}/{artid}.en.aligned"), "r", encoding="utf8") as enfile:
            for de, en in zip(defile, enfile):
                yield (de.strip(), en.strip())

For this XML-file, I am going to use the `xmlfile`-Class of lxml. Before, we always built the XML-file we wanted completely in memory, before printing it to a file. This works well for smaller amounts of data, but can get us in trouble if our RAM is not big enough. `xmlfile` allows us to build *and write* our tree continously. This means, that after we've processed one sentence pair, we can forget about it, because it's already written to the file.

Of course, this is once again not necessary for our example, but if we had thousands of articles, this technique can prove useful.

In [10]:
def create_opus_xml(source_dir, target, start, end):
    with etree.xmlfile(target, encoding="utf8") as xf:
        with xf.element("document", {"src":"de","trg":"en"}):
            
            id_counter = 0
            
            for i in range(start, end+1):
                
                for de, en in pair_iterator(source_dir, str(i)):
                    sent_pair = etree.Element("sent_pair")
                    sent_pair.attrib["id"] = str(id_counter)
                    
                    source = etree.SubElement(sent_pair, "source")
                    source.attrib["lang"] = "de"
                    source.text = de
                    
                    target = etree.SubElement(sent_pair, "target")
                    target.attrib["lang"] = "en"
                    target.text = en
                    
                    xf.write(sent_pair)
                    
                    id_counter += 1
                    
create_opus_xml("alignment_dir", "opus.xml", 160, 160)

## Querying Corpora

A corpus is not defined by it's format. Therefore, a multitude of formats and possibilities to query these formats exists.

We have already gotten to know the three-rowed verticalized format in exercise session 1, which we could query by using UNIX-tools.

I want to present two additional ways to query corpora:

- [Corpus Workbench](https://cqpweb.lancs.ac.uk/): The Corpus Workbench is a tool for corpus linguists. It allows processing large corpora, is rather fast and facilitates the most common techniques of corpus linguistics, such as the Analysis of Collocations, Keywords and Distributions.
    - [Language and Encoding Documentation](http://cwb.sourceforge.net/documentation.php)
    
    
- [multilingwis2](https://pub.cl.uzh.ch/projects/sparcling/multilingwis2.demo/): Multlingwis2 is a tool developped here in Zurich. It is an explorative querying tool for parallel (and *multi*parallel) Corpora. 
    - [The corpora behind](https://pub.cl.uzh.ch/wiki/public/pacoco/start)

They both offer a rather high-level Interface to interact with corpora. The complexity of the data beneath is abstracted and can be navigated more easily, even by people without a computational background. Of course, this disallows some forms of querying, but this is a tradeoff that is always prevalent.