# PED Writer Notebook
In this notebook, we use our insights about error patterns in post-edited data to further improve MT output. Check out the [PED Reader notebook](https://github.com/SeeligA/ped_reader/blob/master/ped_reader_nb.ipynb "PED Reader Notebook") to see how we derived these insights. Since most translation jobs are based on or are at least comptible with the XLIFF exchange format, we will apply our rules to these files.

**Prerequisites:**
* Previous knowledge of *Python* is not strictly required, but certainly helpful.
* If you know your way around the *commandline* and different flavours of *Regular Expressions* you should be fine.


## Notes on XLIFF files
If you are reading this, you are probably a localization engineer and better trained in the use of *XML* and *XLIFF* than I am. If you haven't worked with the XLIFF files before, here are some important things to know:
* It is an XML-based standard developed by the OASIS group. The last version is 2.0, but for our purpose [1.2](http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html "XLIFF 1.2 Specification")  will do.
* The header stores information about the source document(s) as well as definitions for formatting and non-translatable tokens. Some implementations also included the embedded source document to support live previews.
* The body stores so-called **translation units** (TUs) which are sentence-length or paragraph-length tokens of the original text. TUs are usually bilingual, containing a source and a target element (called segments) as well as definitions with various useful metadata about the segments themselves.
* Last but not least, segments often contain **inline elements** to represent formatting or non-translatables. These elements can be a bit tricky for MT, because their position often depends on linguistic context.

In order to extract, process and write XML data, we import a few new functions in addition to Entry and Substitution  modules introduced earlier. 

In [294]:
# The following five lines are for the creator of this notebook. Please ignore.
#%load_ext autoreload
#%autoreload 2
#import logging
#logger = logging.getLogger()
#logger.setLevel(logging.INFO)

import os
from lxml import etree as ET

from source.xliff import create_tree, print_sample_from_file, print_sample
from source.subs import PreprocSub
from source.entries import SearchMTEntry, SearchSourceEntry, ToggleCaseEntry, ApplyTagEntry
from source.utils import unzip_sample

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Reading and pretty printing
Next, let's see what kind of data we are dealing with. Running the next cell will read the sample file and print out a preview of a single translation unit.
```
tree, tus = create_tree(fp)
```
Note that the line above will create two new objects to help us parsing string data:
- a Tree object representing and indexing all elements in the XML file.
- a list of translation unit nodes from that Tree, on which we will iterate over during processing

In [296]:
dir_name = "in"
fps = unzip_sample(dir_name)

# Create a tree object and a list of translation units
tree, tus = create_tree(fps[0])
# Change this value to print a different translation unit
tu_id = 7
print_sample(tus, tu_id)

Sample extracted here: in
('<trans-unit xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:sdl="http://sdl.com/FileTypes/SdlXliff/1.0" '
 'id="f049fffc-3cb2-4869-ba10-f76b02e3593c">\n'
 '  <source>The collection represents <g id="227">100%</g> of the families <g id="233">and</g> subfamilies, <g '
 'id="239">and</g> <g id="245">55%</g> of the genera reported for Colombia.</source>\n'
 '  <seg-source>\n'
 '    <mrk mtype="seg" mid="8">The collection represents <g id="227">100%</g> of the families <g id="233">and</g> '
 'subfamilies, <g id="239">and</g> <g id="245">55%</g> of the genera reported for Colombia.</mrk>\n'
 '  </seg-source>\n'
 '  <target>\n'
 '    <mrk mtype="seg" mid="8">Esta colección representa el 100% de las familias y subfamilias, y el 55% de los '
 'géneros registrados para Colombia.</mrk>\n'
 '  </target>\n'
 '  <sdl:seg-defs>\n'
 '    <sdl:seg id="8" conf="Draft" origin="mt" origin-system="Alignment11">\n'
 '      <sdl:value key="SDL:OriginalTranslationHash"/>\n'
 ' 

## Modifying string data and inserting elements
As you can see above, our string data is embedded in a hierarchical structure of elements. Instead of searching and replacing strings directly from a table, we will need to parse text from the `<seg-source>MyString</seg-source>` and `<target>MeinString</target>` elements first. There is an upside, too: it gives us flexibility to modify the string environment and how text is displayed in our CAT editor and the target document.

To illustrate this, I am introducing a new Entry object, which looks up inline formatting tags or placeholders in the source and applies the same formatting to the corresponding match in the target. This new entry **only works with XLIFF files, it does not work with the table data** from the previous notebook. This is because the table data does not include tags, formatting or otherwise.

Back to our sample file, it appears that some formatting tags are missing in the target. Let's fix this by wrapping some customs rules in a substitution object and applying it to the samples files in our directory…

In [297]:
x = ApplyTagEntry(997, "ASE", None, "Format: color green 'percentage", "All", "All", "(?<=[^>])\d+\%(?=([^<]|$))", None, '(<(\w+) [^>]+?>\d+%</\w+>)')
y = ApplyTagEntry(998, "ASE", None, "Format: italics 'and'", "EN", "ES", "(?<=[^>])\by\b(?=([^<]|$))", "y", '(<(\w+) [^>]+?>and</\w+>)')
z = ApplyTagEntry(999, "ASE", None, "Format: color red 'nums'", "All", "All", "(?<=[^>])\d+(?=([^<]|$))", None, '(<(\w+) [^>]+?>\d+?</\w+>)')

subs_list = [z, y, x]
subs = PreprocSub(created_by="ASE", desc="Testing", entries=subs_list)

# Note the binary flag to control whether the sample file gets overwritten or not
cache = subs.apply_to_working_files(dir_name, write=False)
print_sample(cache[fps[0]], tu_id)

Backup containing 1 files created here: in
('<trans-unit xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:sdl="http://sdl.com/FileTypes/SdlXliff/1.0" '
 'id="f049fffc-3cb2-4869-ba10-f76b02e3593c">\n'
 '  <source>The collection represents <g id="227">100%</g> of the families <g id="233">and</g> subfamilies, <g '
 'id="239">and</g> <g id="245">55%</g> of the genera reported for Colombia.</source>\n'
 '  <seg-source>\n'
 '    <mrk mtype="seg" mid="8">The collection represents <g id="227">100%</g> of the families <g id="233">and</g> '
 'subfamilies, <g id="239">and</g> <g id="245">55%</g> of the genera reported for Colombia.</mrk>\n'
 '  </seg-source>\n'
 '  <target>\n'
 '    <mrk mtype="seg" mid="8">Esta colección representa el <g id="227">100%</g> de las familias <g id="233">y</g> '
 'subfamilias, <g id="239">y</g> el <g id="245">55%</g> de los géneros registrados para Colombia.</mrk>\n'
 '  </target>\n'
 '  <sdl:seg-defs>\n'
 '    <sdl:seg id="8" conf="Draft" origin="mt" origin-syste

## Next Steps
Once you have determined that all your rules work as expected, you can add them to any existing substitution items you might have on file.

In [305]:
fp = os.path.join("out", "wmt16_en-es.json")
new_subs = PreprocSub(fp=fp)

for i in subs_list:
    new_subs.entries.append(i)
new_subs.convert_to_json()

{'__class__': 'PreprocSub',
 '__module__': 'source.subs',
 'version': 0.1,
 'created_by': 'ASE',
 'desc': 'For WMT 16 testset (EN-ES)',
 'ped_effect': -3.0750116881750333e-05,
 'entries': [{'__class__': 'SearchSourceEntry',
   '__module__': 'source.entries',
   'ID': 0,
   'created_by': 'ASE',
   'ped_effect': 0.0,
   'desc': 'Term: Methodology: methodología',
   't_lid': 'EN',
   's_lid': 'ES',
   'search': 'Materiales y métodos:',
   'replace': 'Metodología',
   'condition': 'Source+MT',
   'source': '^Methodology:'},
  {'__class__': 'SearchSourceEntry',
   '__module__': 'source.entries',
   'ID': 1,
   'created_by': 'ASE',
   'ped_effect': 0.0,
   'desc': 'Term: rupture: ruptura',
   't_lid': 'EN',
   's_lid': 'ES',
   'search': '\\bperforación(?:e)?(s)?',
   'replace': 'ruptura\\1',
   'condition': 'Source+MT',
   'source': '\\b[Rr]upture'},
  {'__class__': 'SearchSourceEntry',
   '__module__': 'source.entries',
   'ID': 2,
   'created_by': 'ASE',
   'ped_effect': 0.0,
   'desc': '