# PED Writer Notebook
In this notebook, we use our insights about error patterns in post-edited data to further improve MT output. Check out the [PED Reader notebook](https://github.com/SeeligA/ped_reader/blob/master/ped_reader_nb.ipynb "PED Reader Notebook") to see how we arrived at this point. Since most translation jobs are based on or are at least comptible with the XLIFF exchange format, we will apply our rules to these files.

**Prerequisites:**
* Previous knowledge of *Python* is not strictly required, but certainly helpful.
* If you know your way around the *commandline* and different flavours of *Regular Expressions* you should be fine.


## 1. Notes on XLIFF files
If you are reading this, you are probably a localization engineer trained in the usage of *XML* and *XLIFF*. If you haven't worked with the XLIFF files before, here are some important things to know:
* It is an XML-based standard developed by the OASIS group. The last version is 2.0, but for our purpose [1.2](http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html "XLIFF 1.2 Specification")  will do.
* The header stores information about the source document(s) as well as definitions for formatting and non-translatable tokens. Some implementations also included the embedded source document to support live previews.
* The body stores so-called **translation units** (TUs) which are sentence-length or paragraph-length tokens of the original text. TUs are usually bilingual parent elements, containing a source and a target child (called segments) as well as definitions with various useful metadata about the segments themselves.
* Last but not least, segments often contain **inline elements** to represent formatting or non-translatables. These elements can be a bit tricky for MT, because their position often depends on linguistic context.

In order to extract, process and write XML data, we import a few new functions in addition to the Entry and Substitution  modules introduced earlier. 

In [1]:
# The following five lines are for the creator of this notebook. Please ignore.
#%load_ext autoreload
#%autoreload 2
#import logging
#logger = logging.getLogger()
#logger.setLevel(logging.INFO)

from source.xliff import create_tree, print_sample_from_file, print_sample
from source.subs import PreprocSub
from source.entries import SearchMTEntry, SearchSourceEntry, ToggleCaseEntry, ApplyTagEntry
from source.utils import unzip_sample, retrieve_file_paths

from ipywidgets import interactive
import ipywidgets as widgets

## 2. Reading and pretty printing
Next, let's see what kind of data we are dealing with. Running the next cell will unzip and read the sample file in the `data` folder and the print out a preview of a single translation unit. Use the slider to browse through the file.
To include your own XLIFF file, uncomment the fourth line `fps.append(input('Path to input file: '))` and when prompted enter path to your file.
```
tree, tus = create_tree(fp)
```
Note that the line above will create two new objects to help us parsing string data:
- a Tree object representing and indexing all elements in the XML file.
- a list of translation unit nodes from that Tree, on which we will iterate over during processing

In [2]:
dir_name = "data"
unzip_sample(dir_name)
fps = retrieve_file_paths(dir_name)

# Uncomment the following line to enable the input file prompt
#fps.append(input('Path to input file: '))

def sample_widget(file):
    # Create a tree object and a list of translation units    

    tree, tus = create_tree(file)    
    
    def print_sample_widget(tu_id):
        print_sample(tus, tu_id)
        
    v = interactive(print_sample_widget, tus=tus, tu_id=widgets.IntSlider(min=0, max=len(tus)-1, step=1, value=0, continuous_update=False))
    display(v)
    
w = interactive(sample_widget, file=fps)
display(w)

Sample extracted here: data
Backup containing 2 files created here: data


interactive(children=(Dropdown(description='file', options=('data\\S0123_en.docx.sdlxliff', 'data\\Wochenkarte…

## 3. Modifying string data and inserting elements
As you can see above, our string data is embedded in a hierarchical structure of elements. Instead of searching and replacing strings directly from a table, we will need to parse text from the `<seg-source>MyString</seg-source>` and `<target>MeinString</target>` elements first. There is an upside, too: it gives us flexibility to modify the string environment and how text is displayed in our CAT editor and the target document.

To illustrate this, I am introducing a new Entry object, which looks up inline formatting tags or placeholders in the source and applies the same formatting to the corresponding match in the target. This new entry **only works with XLIFF files, it does not work with the table data** from the previous notebook. This is because the table data does not include tags, formatting or otherwise.

Back to our sample file, it appears that some formatting tags are missing in the target. Let's fix this by wrapping some customs rules in a substitution object and applying it to the samples files in our directory…

In [8]:
x = ApplyTagEntry({"ID": 999, "s_lid": "All", "t_lid": "All", 'desc': "Format: color green 'percentage", 
                   "search":"(?<=[^>])\d+\%(?=([^<]|$))", "replace": None,
                   "source_filter": '(<(\w+) [^>]+?>\d+%</\w+>)'
                  })
y = ApplyTagEntry({"ID": 997, "s_lid": "EN", "t_lid": "ES", 'desc': "Format: italics 'and'", 
                  "search": "(?<=[^>])\by\b(?=([^<]|$))", "replace":"y", 
                  "source_filter":'(<(\w+) [^>]+?>and</\w+>)'
                  })

z = ApplyTagEntry({"ID": 998, "s_lid": "All", "t_lid": "All", 'desc': "Format: color red 'nums'", 
                  "search": "(?<=[^>])\d+(?=([^<]|$))", "replace": None, 
                  "source_filter": '(<(\w+) [^>]+?>\d+?</\w+>)'
                  })

subs_list = [z, y, x]
subs = PreprocSub(created_by="ASE", desc="Testing", entries=subs_list)

cache = subs.apply_to_working_files(fps)
display(w)

interactive(children=(Dropdown(description='file', options=('data\\S0123_en.docx.sdlxliff', 'data\\Wochenkarte…

## 4. Next Steps
Once you have found your rules to be working as expected, you can add them to any existing substitution items you might have on file.

In [9]:
import os
import json 
fp = os.path.join("out", "wmt16_en-es.json")
new_subs = PreprocSub(fp=fp)

for i in subs_list:
    new_subs.entries.append(i)
print(json.dumps(new_subs.convert_to_json(), indent=4, ensure_ascii=False))
#new_sub.apply_to_working_files(fps)
#new_subs.convert_to_json(fp)

{
    "__class__": "PreprocSub",
    "__module__": "source.subs",
    "version": 0.1,
    "created_by": "ASE",
    "desc": "For WMT 16 testset (EN-ES)",
    "ped_effect": 0.0010021806317097592,
    "entries": [
        {
            "__class__": "SearchSourceEntry",
            "__module__": "source.entries",
            "ID": 0,
            "search": "ANTECEDENTES(?= Y OBJETIVOS)",
            "replace": "JUSTIFICATIVA",
            "created_by": null,
            "ped_effect": 0.0009035115795690829,
            "desc": "Term: BACKGROUND: JUSTIFICATIVA",
            "t_lid": "ES",
            "s_lid": "EN",
            "source_filter": "^BACKGROUND AND OBJECTIVES:"
        },
        {
            "__class__": "SearchSourceEntry",
            "__module__": "source.entries",
            "ID": 1,
            "search": "ÍNDICE:",
            "replace": "CONTENIDO:",
            "created_by": null,
            "ped_effect": 7.659633047024661e-05,
            "desc": "Term: CONTENTS: CONTE

## 5.  Conclusion
In this notebook we have explored options to navigate through XLIFF files using interactive controls. We have  introduced a new substition entry to pull tags in the target segment. We then applied our entries to a number of pre-translated files to further improve the output. 

This concludes this notebook. If you found any of this content helpful or confusing, please let me know. [mailto](mailto:arnseelig[at]gmail.com)