# PED Writer Notebook
In this notebook, we use our insights about error patterns in post-edited data to further improve MT output. Check out the [PED Reader notebook](https://github.com/SeeligA/ped_reader/blob/master/ped_reader_nb.ipynb "PED Reader Notebook") to see what we how we derived these insights. Since most translation jobs are based on or are at least comptible with the XLIFF exchange format, we will apply our rules to these files.

If you haven't worked with the XLIFF files before, here are some important things to know:
* It is an XML-based standard developed by the OASIS group. The last version is 2.0, but for our purpose [1.2](http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html "XLIFF 1.2 Specification")  will do.
* It has a header which stores information about the source document(s) and sometimes the source document(s) itself as well as definitions for formatting and non-translatable tokens.
* The body stores so-called translation units (TUs) which are sentence-length or paragraph-length tokens of the original text. TUs are normally bilingual, containing a source and a target element (called segments) as well as definitions with various useful metadata about the segments themselves.
* Last but not least, segments often contain inline elements to represent formatting or non-translatables. These elements can be a bit tricky for MT, because their position often depends on linguistic context.

In order to parse XML data, we will need another library in addition to the convenience functions introduced earlier.

In [6]:
# The following two lines are for the creator of this notebook. Please ignore.
#%load_ext autoreload
#%autoreload 2

import os
from lxml import etree as ET
import pprint

from source.xliff import create_tree
from source.subs import PreprocSub
from source.entries import SearchMTEntry, SearchSourceEntry, ToggleCaseEntry

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
def print_sample(fp, tu_id):
    tree, tus = create_tree(fp)
    sample = ET.tostring(tus[tu_id], encoding='utf-8', pretty_print=True).decode('utf-8')
    pp = pprint.PrettyPrinter(indent=4, width=120)
    pp.pprint(sample)
    
filename = "Wochekarte der Bäckerei Staib im Höhenblick.txt.sdlxliff"
directory = "data"
fp = os.path.join(directory, filename)
tu_id = 10

print_sample(fp, tu_id)

('<trans-unit xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:sdl="http://sdl.com/FileTypes/SdlXliff/1.0" '
 'id="aa63d552-6a04-4b9f-9477-5edb233f4ea4">\n'
 '  <source>WIR FREUEN UNS auf Ihren Besuch.</source>\n'
 '  <seg-source>\n'
 '    <mrk mtype="seg" mid="11">WIR FREUEN UNS auf Ihren Besuch.</mrk>\n'
 '  </seg-source>\n'
 '  <target>\n'
 '    <mrk mtype="seg" mid="11">We look forward to your visit.</mrk>\n'
 '  </target>\n'
 '  <sdl:seg-defs>\n'
 '    <sdl:seg id="11" conf="Draft" origin="mt" origin-system="DeepL Translator provider using DeepL Translator ">\n'
 '      <sdl:value key="SegmentIdentityHash">btLWNvEALVWgAnolRNWT8q5F5k0=</sdl:value>\n'
 '    </sdl:seg>\n'
 '  </sdl:seg-defs>\n'
 '</trans-unit>\n')


In [8]:
a = ToggleCaseEntry(999, "ASE", None, "Testing", "All", "ES", 10, 'upper', None)
subs_list = list([a])
subs = PreprocSub(created_by="ASE", desc="Testing", entries=subs_list)


subs.apply_to_working_files(directory, write=True)

print_sample(fp, tu_id)

Backup containing 1 files created here: data
('<trans-unit xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:sdl="http://sdl.com/FileTypes/SdlXliff/1.0" '
 'id="aa63d552-6a04-4b9f-9477-5edb233f4ea4">\n'
 '  <source>WIR FREUEN UNS auf Ihren Besuch.</source>\n'
 '  <seg-source>\n'
 '    <mrk mtype="seg" mid="11">WIR FREUEN UNS auf Ihren Besuch.</mrk>\n'
 '  </seg-source>\n'
 '  <target>\n'
 '    <mrk mtype="seg" mid="11">WE LOOK FOrward to your visit.</mrk>\n'
 '  </target>\n'
 '  <sdl:seg-defs>\n'
 '    <sdl:seg id="11" conf="Draft" origin="mt" origin-system="DeepL Translator provider using DeepL Translator ">\n'
 '      <sdl:value key="SegmentIdentityHash">btLWNvEALVWgAnolRNWT8q5F5k0=</sdl:value>\n'
 '    </sdl:seg>\n'
 '  </sdl:seg-defs>\n'
 '</trans-unit>\n')
