# PED Reader Notebook

This notebook supports the analysis of MT and MTPE projects. It provides the relevant Python code as well as notes for additional context to new users. The notebook takes JSON format data generated by the [Woerdle-zehla](https://github.com/SeeligA/woerdle-zehla "Measuring Post-Edit density for Studio files and Across LS exports") app as inputs and creates visualizations and post-processed tables as outputs.

**Prerequisites:**
* Previous knowledge of *Python* is not strictly required, but certainly helpful.
* If you know your way around the *commandline* and different flavours of *Regular Expressions* you should be fine. 

First, let's import the necessary scripts and objects from our project directory.

In [1]:
# The following 5 lines are for the creator of this notebook. Please ignore.
#%load_ext autoreload
#%autoreload 2
#import logging
#logger = logging.getLogger()''
#logger.setLevel(logging.INFO)

from source.table import create_df, build_query, filter_items, save_to_excel
from source.calculation import pe_density
from source.utils import plot, obj_to_dict, dict_to_obj
from source.subs import PreprocSub
from source.entries import SearchMTEntry, SearchSourceEntry, ToggleCaseEntry
from source.controls import MyFilterWidget

from ipywidgets import interact

## 1. Creating a dataset
First, we will create a table by iterating over the JSON files contained in the `data` directory. The table columns will be populated with items at the root level ("Relation", "Project" "Document", "s_lid", "t_lid") as well as the segment scores and string data at the ped_details level. We can then use any of the root-level items to create subsets using a custom query expression.

In [2]:
# Create table from file in data subfolder
data_path = input('Path to JSON data: ')
df = create_df(data_path)
print(df.describe())
# Show the last 5 rows of the table
df.tail()

Path to JSON data: data
              score
count  11994.000000
mean   0.372192    
std    0.176878    
min    0.000000    
25%    0.244898    
50%    0.353403    
75%    0.481781    
max    1.000000    


Unnamed: 0,Project,Relation,Document,s_lid,t_lid,score,source,target,mt
11989,test-gma-en2es-health,WMT_16_ped,S0021.csv.sdlxliff,EN,ES,0.478528,"Of the remaining 122 infants, 61 were kept at the intensive care units of public hospitals, and 61 were transferred to a private unit.","De los 122 niños incluidos, 61 quedaron en la unidad del hospital público en que nacieron y 61 fueron referidos a unidades privadas.","De los 122 lactantes restantes, 61 se mantuvieron en las unidades de cuidados intensivos de los hospitales públicos, y 61 fueron transferidos a una unidad privada."
11990,test-gma-en2es-health,WMT_16_ped,S0021.csv.sdlxliff,EN,ES,0.528455,The infants who were transferred presented lower gestational age and increased neonatal depression.,Los atendidos en el sector privado resultaron más prematuros y con mayor frecuencia de depresión neonatal.,Los recién nacidos que fueron transferidos presentaron una edad gestacional más baja y un aumento de la depresión neonatal.
11991,test-gma-en2es-health,WMT_16_ped,S0021.csv.sdlxliff,EN,ES,0.55287,"However, mortality among infants treated at intensive care units of public hospitals was twice as high (Hazard Ratio 1.8; 95%CI 1.1- 3.4; P=0.04), especially in infants who weighed less than 1,000g (Hazard Ratio 2.4; 95%CI 1.1-5.5; P=0.04).","Sin embargo, la mortalidad en el sector público fue casi dos veces mayor (Hazard Ratio 1.8, IC 95% 1.1-3.4, p=0,04), fundamentalmente en los menores de 1000 gramos (Hazard Ratio 2.4, IC 95% 1.1-5.5, p=0,04).","Sin embargo, la mortalidad entre los recién nacidos tratados en las unidades de cuidados intensivos de los hospitales públicos fue dos veces mayor (cociente de riesgos 1,8; IC del 95%: 1,1 a 3,4; p = 0,04), especialmente en los recién nacidos que pesaban menos de 1.000 g (cociente de riesgos 2,4; IC del 95%: 1,1 a 5,5; p = 0,04)."
11992,test-gma-en2es-health,WMT_16_ped,S0021.csv.sdlxliff,EN,ES,0.397906,"CONCLUSIONS: the health status of very low birth weight infants treated at intensive care units of public and private hospitals in Montevideo, Uruguay, was assessed.","CONCLUSIÓN: se realizó una evaluación de la atención de los niños de muy bajo peso atendidos en unidades intensivas públicas y privadas de Montevideo, Uruguay.","CONCLUSIONES: se evaluó el estado de salud de los niños de muy bajo peso al nacer tratados en las unidades de cuidados intensivos de los hospitales públicos y privados de Montevideo, Uruguay."
11993,test-gma-en2es-health,WMT_16_ped,S0021.csv.sdlxliff,EN,ES,0.680851,"Mortality was lower, and health care was better in neonatal units of private hospitals.",Hubo menor mortalidad en los niños atendidos en el sector privado y algunas evidencias de que la calidad de atención es mejor en éste sector.,La mortalidad fue menor y la atención de salud fue mejor en las unidades neonatales de los hospitales privados.


In order to make our view of the data more insightful, we should create a query and filter for certain attributes or items: 
* Run the cell below to open an interactive widget with different filter options. 
* By selecting and deselecting items you create a custom query string.
* To reset your filter settings, run the cell again.

- - -
#### Notes
* Use Ctrl or Shift to select multiple filter values. 
* The ```score``` parameter specifies a PED range. The default range is between zero and one (inclusive).
* If you prefer using your own Pandas query, switch to the Python tab and paste it in there.

In [3]:
w = MyFilterWidget(df)
display(w, w.out)

MyFilterWidget(children=(Accordion(children=(SelectMultiple(description='Relation', index=(0,), layout=Layout(…

Output()

Now it's time to apply the query to your data table.

Note that if your query has no matches, you will receive an **AssertionError**

In [4]:
# Note that if your query has no matches, you will receive an AssertionError
data = w.run_query()
# Calculate PED scores at a dataset-level and at a segment level and
# write the latter to a new column called "virtual". 
# At this stage, the new column should be identical to the "score" column.
ped, data = pe_density(data)
print('The aggregated PED for this dataset is {:f}'.format(ped))
data.head()

The aggregated PED for this dataset is 0.384075


Unnamed: 0,Project,Relation,Document,s_lid,t_lid,score,source,target,mt,virtual,max_char,lev
3198,test-gma-en2es-health,WMT_16_ped,S0034.csv.sdlxliff,EN,ES,0.450644,BACKGROUND AND OBJECTIVES: Some studies have reported improved quality of peribulbar block by adding hyaluronidase to the local anesthetic solution while others claimed no beneficial effect.,"JUSTIFICATIVA Y OBJETIVOS: Algunos estudios han relatado mejoria de la calidad del bloqueo peribulbar con el uso de hialuronidasa, en cuanto otros han concluido por la ausencia del efecto.","ANTECEDENTES Y OBJETIVOS: Algunos estudios han informado una mejoría en la calidad del bloqueo peribulbar mediante la adición de hialuronidasa a la solución anestésica local, mientras que otros no afirmaron ningún efecto beneficioso.",0.450644,233,105.0
3199,test-gma-en2es-health,WMT_16_ped,S0034.csv.sdlxliff,EN,ES,0.011765,This study aimed at investigating the influence of hyaluronidase on intraocular pressure (IOP) and the quality of peribulbar block with 1% ropivacaine.,El objetivo de este estudio fue investigar la influencia de la hialuronidasa sobre la presión intra-ocular (PIO) y la calidad del bloqueo peribulbar con ropivacaína a 1%.,El objetivo de este estudio fue investigar la influencia de la hialuronidasa sobre la presión intraocular (PIO) y la calidad del bloqueo peribulbar con ropivacaína al 1%.,0.011765,170,2.0
3200,test-gma-en2es-health,WMT_16_ped,S0034.csv.sdlxliff,EN,ES,0.364706,"CONCLUSIONS: When 1% ropivacaine supplemented with 50 IU.ml-1 hyaluronidase is used in peribulbar block, IOP values are lower and blockade quality is significantly better than when 1% plain ropivacaine is used.","CONCLUSIONES: Cuando se usa solución de ropivacaína a 1% adicionada de hialuronidasa 50 UI.ml-1 en bloqueo peribulbar, los valores de la PIO son menores y la calidad del bloqueo es mejor de que cuando se utiliza ropivacaína a 1% sin hialuronidasa.","CONCLUSIONES: Cuando se utiliza ropivacaína al 1% complementada con 50 IU.ml-1 hialuronidasa en el bloqueo peribulbar, los valores de la PIO son más bajos y la calidad del bloqueo es significativamente mejor que cuando se utiliza ropivacaína simple al 1%.",0.364706,255,93.0
3201,test-gma-en2es-health,WMT_16_ped,S0034.csv.sdlxliff,EN,ES,0.245902,Its action mechanism is broadly discussed based on results of experimental studies and clinical evidences.,"Su mecanismo de acción es bastante discutido, con base en resultados experimentales y en evidencias clínicas.",Su mecanismo de acción es ampliamente discutido en base a los resultados de estudios experimentales y evidencias clínicas.,0.245902,122,30.0
3202,test-gma-en2es-health,WMT_16_ped,S0034.csv.sdlxliff,EN,ES,0.210526,"In vitro and in vivo preparations followed the techniques of Bulbring and Leeuwin and Wolters, respectively.","Las preparaciones in vitro e in vivo fueron montadas de acuerdo con las técnicas de Bulbring y de Leeuwin y Wolters, respectivamente.","Las preparaciones in vitro e in vivo siguieron las técnicas de Bulbring y Leeuwin y Wolters, respectivamente.",0.210526,133,28.0


## 2. Plotting post-edit density data
Now that we have prepared the data we're interested in, we can plot the distribution of post-edit density scores. Depending on our analysis question we might be interested in the total number of segments per bin or a normalized view of each bin. The KDE flag allows us to control the scaling. KDE stands for [Kernel Density Estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation "Kernel Density Estimation on Wikipedia"), a kind of rolling average plotted as a non-linear curve.

<img src="out\WMT16_sample.png" alt="Sample plot" width="400" align="right"/>


```
plot(data, cat_column="t_lid", 
     kde=True, 
     ped=ped, 
     save="out\\sample.png",
     linewidth=4 
     )
```


Try answering these questions:
* How does the distribution change as we specify different categorical values?
* Are there any features that seem to correlate with the post-edit density distribution?
* Do you think your dataset from above fits your question? Are there any other filter combinations you could try out?


Once you are satisfied with the items represented in your dataset, we can move on to the next step: MT quality evaluation.


In [5]:
# TODO: Plot the post-edit density data using different columns names as categorical values.
# This works best with less than 10 categories per column. 
# Try plotting for source and target languages, relation or project IDs.
@interact
def plot_widget(cat_column=["t_lid", "Document", "Relation", "Project"], save=[False, "PED_by_<cat_column>.png"], kde=True, linewidth=(0,10)):  
    if save:
        save = save.replace("<cat_column>", "{}").format(cat_column)        
    plot(data, cat_column=cat_column, 
         kde=kde, 
         ped=ped, 
         save=save,
         linewidth=linewidth,
         fontsize=20
         )

interactive(children=(Dropdown(description='cat_column', options=('t_lid', 'Document', 'Relation', 'Project'),…

## 3. Quality evaluation
Quality evaluation (QE) is a two-step process to uncover error patterns in translated output. This is commonly done in a spreadsheet, but some CAT environments such as Trados Studio offer [integrations as well](https://www.youtube.com/watch?v=a2wid7Uxy54 "How to use Translation Quality Assessment in SDL Trados Studio 2019").

In a first step, we evaluate MT output against the source and a reference translation created by a linguist. In order to quantize a potentially infinite number of errors, we map our findings against a fixed-size **error typology**. In addition to that, we categorize errors depending on severity and project requirements. An important reference point used in translation services are the [Multidimensional Quality Metrics (MQM)](http://www.qt21.eu/mqm-definition/definition-2015-12-30.html "Multidimensional Quality Metrics (MQM) Definition") as well as MQM subsets such as the Dynamic Quality Framework (DQF).

- - -

Note: QEs are also used to evaluate the quality of a particular delivery. In this case, errors are counted, weighted and summed to produce an **error score**. This score is then divided by the word count to produce the **error rate**, which determines whether the delivery has __passed or failed__ the quality evaluation. While this type of QE often includes a summary to highlights features of the text and appreciates positive and negative aspects, analysing errors for structural patterns is beyond its scope.

- - -

In a second step, we look at the annotations from a bird's-eye view and try to determine common patterns and sources for errors. **Visualisations** can prove useful here, including *word clouds* and *bar charts* for types of error or *pie charts* for severity. If deemed practical, this can be followed up by another round of annotations and the cycle repeats.

For our purpose, quality evaluations result in two types of deliverables:
1. A report on the general quality detailing areas of focus and potential improvement strategies.
2. A list of resources, including *terminology items*, *replacement patterns*, *automated QA checks*, *additional instructions*, etc.

#### QUESTION:

Based on what has been said about the purpose of QE, how would you adapt the tools available to us to achieve these aims?

In [None]:
# Uncomment the next line to save your data to your working directory.
#save_to_excel(data, "out\\qe_data.xlsx")

## 4. Post-processing replacement patterns
Moving from analysis to action. There are three basic approaches to improving MT output:
* Engine re-training: This requires __access to the model parameters__.
* Pre-processing: This involves replacing or annotating tokens in the input __before applying an engine__.
* Post-processing: This involves __replacing tokens in the output__ based on search expressions or rules.

Post-processing can be combined with pre-processing or used on its own. In the following section we will focus on the latter approach.

### Creating rules
First, we will create a number a rules to find and replace incorrect terms. In the [entries](source/entries.py) module, there are three different objects we can choose from:
- SearchMTEntry -- Search and replace on the MT output only
- SearchSourceEntry -- Filter segments based on a source expression. Then perform search and replace on the MT output.
- ToggleCaseEntry -- Filter segments based on expected case and length of source. Then apply the case to the target.

An Entry object is characterized by its `search` and `replace` attributes. In addition to this, `SearchSourceEntry` and `ToggleCaseEntry` objects also feature a `source` attribute, which acts like a pre-filter for source strings.

In [6]:
a = SearchSourceEntry(1, "ASE", None, "Term: Methodology: methodología", "EN", "ES", "Materiales y métodos:", "Metodología", "^Methodology:")
b = SearchSourceEntry(1, "ASE", None, "Term: rupture: ruptura", "EN", "ES", "\\bperforación(?:e)?(s)?", "ruptura\\1", "\\b[Rr]upture")
c = SearchSourceEntry(1, "ASE", None, "Term: patients: pacientes", "EN", "ES", "\\bpersonas\\b", "pacientes", "\\bpatients\\b")
d = SearchSourceEntry(1, "ASE", None, "Term: defects: defectos", "EN", "ES", "\\bproblem(?:a)?(s)?\\b", "defecto\\1", "\\bdefect")
e = SearchSourceEntry(1, "ASE", None, "Term: Breeding season: temporada de reproducción", "EN", "ES", "\\bestación reproductiva\\b", "temporada de reproducción", "\\bbreeding season")
f = SearchSourceEntry(1, "ASE", None, "Term: lineage: linaje", "EN", "ES", "\\bencaste(s)?\\b", "linaje\\1", "\\blineage")
g = SearchMTEntry(1, "ASE", None, "Term: heterocigosis -> heterocigosidad", "EN", "ES", "\\bheterocigosis\\b", "heterocigosidad")
h = ToggleCaseEntry(999, "ASE", None, "Capitalize first character if this is reflected in the source", "All", "ES", 1, 'upper', None)

# To test the search and source 
# If you receive a UserWarning, please ignore it.
filter_items(exp=b.search, data=data, col="mt")

  my_filter = data[col].str.contains(p, regex=True)


Unnamed: 0,Project,Relation,Document,s_lid,t_lid,score,source,target,mt,virtual,max_char,lev
5296,test-gma-en2es-health,WMT_16_ped,S0026.csv.sdlxliff,EN,ES,0.172222,"Conclusions: the factors associated with appendiceal perforation were delay in medical attention, previous medication, and the type of insurance owned by the patient.","Conclusiones: Los factores asociados a la perforación apaendicular son el retraso en la atención médica, la medicación previa y el tipo de seguro del paciente.","Conclusiones: los factores asociados a la perforación del apéndice fueron el retraso en la atención médica, la medicación previa y el tipo de seguro del que es titular el paciente.",0.172222,180,31.0


### Compiling substitution sets of rules
After creating some rules relevant to the error patterns found in the data, we are going to apply them and measure their impact on the PED. This includes: 
1. Creating a `PreprocSub` wrapper object to simplify applying and managing our set of entries.
2. Applying the set of entries to our data.
3. Recalculating the PED score for each segment.

Note that step three will provide us with updated scores for each entry. This is useful to determine the statistical impact of a rule. If a rule has little, no or a negative impact, we probably don't want to include it in our set of entries. Every time we apply an entry object to a table, its `ped_effect` attribute will be updated accordingly. 

In [7]:
subs_list = list([a,b,c,d,e,f,g,h])

subs = PreprocSub(created_by="ASE", desc="For WMT 16 testset (EN-ES)", entries=subs_list)
data = subs.apply_to_table(data, verbose=True)
print(subs.ped_effect)

Original PED:	0.384075
Updated PED:	0.384075	Term: Methodology: methodología


  return df["virtual"].mask(df.mt.str.contains(p) == True)


Updated PED:	0.384075	Term: rupture: ruptura
Updated PED:	0.384092	Term: patients: pacientes
Updated PED:	0.384092	Term: defects: defectos
Updated PED:	0.384092	Term: Breeding season: temporada de reproducción
Updated PED:	0.384092	Term: lineage: linaje
Updated PED:	0.384092	Term: heterocigosis -> heterocigosidad
Updated PED:	0.384092	Capitalize first character if this is reflected in the source
-1.6513632979386905e-05


In [8]:
# Note how the list of filtered items got smaller
filter_items(exp=b.search, data=data, col="mt")

Unnamed: 0,Project,Relation,Document,s_lid,t_lid,score,source,target,mt,virtual,max_char,lev
5296,test-gma-en2es-health,WMT_16_ped,S0026.csv.sdlxliff,EN,ES,0.172222,"Conclusions: the factors associated with appendiceal perforation were delay in medical attention, previous medication, and the type of insurance owned by the patient.","Conclusiones: Los factores asociados a la perforación apaendicular son el retraso en la atención médica, la medicación previa y el tipo de seguro del paciente.","Conclusiones: los factores asociados a la perforación del apéndice fueron el retraso en la atención médica, la medicación previa y el tipo de seguro del que es titular el paciente.",0.172222,180.0,31.0


In [9]:
# Delete last entry == lowest or most negative PED effect
deleted = subs.entries.pop(-1)
deleted.desc

'Term: patients: pacientes'

Considering how marginal the PED gains are, what approach would you take to identify suitable candidate rules? Depending  on the size and homogeneity of your dataset, are any of the tools presented in [section 3](#3.-Quality-evaluation)  more or less useful than others?

### Serializing / deserializing substitution lists for storage
In order to store and reuse our set of entries, we use two convenience methods called `convert_to_json` and `load_from_json` that take care of serializing our custom objects. We use the JSON format, because it allows us to review and edit our set in a standard text editor.

To preview how your export will look like, run the following cell:
```
import json
print(json.dumps(subs.convert_to_json(), indent=4, ensure_ascii=False))
```
Note that entries have been re-indexed according to their PED effect on our dataset.

In [None]:
#import json
#print(json.dumps(subs.convert_to_json(), indent=4, ensure_ascii=False))

In [10]:
# Serialize list to disk
fp = "out\\wmt16_en-es.json"
subs.convert_to_json(fp);
# To deserialize simply pass in the file path:
new_subs = PreprocSub(fp=fp)
new_subs.entries

[<source.entries.SearchSourceEntry at 0x22bc3ca8e10>,
 <source.entries.SearchSourceEntry at 0x22bc3ca8dd8>,
 <source.entries.SearchSourceEntry at 0x22bc3ca86d8>,
 <source.entries.SearchSourceEntry at 0x22bc3ca80f0>,
 <source.entries.SearchSourceEntry at 0x22bc3ca89e8>,
 <source.entries.SearchMTEntry at 0x22bc40e5470>,
 <source.entries.ToggleCaseEntry at 0x22bc40e52b0>]

## 5. Conclusion
Let's summarize what we have done so far. We have seen how we can aggregate, visualize and export PED data to make better sense of our post-editing efforts. We have discussed how quality evaluation relates to our efforts to improve MT output incrementally. We then presented a small set of tools to create and test post-processing steps. This concludes this notebook. 

If you are interested how to apply our tools to translation files used in a CAT environment, there is supplementary PED writer notebook [here](ped_writer_nb.jpynb "PED Writer Notebook"). Hope to see your there!

If you found any of this content helpful or confusing, please let me know. [mailto](mailto:arnseelig[at]gmail.com)

## Sources:
This notebook uses data from the [ACL 2016 Conference on Machine Translation](https://www.statmt.org/wmt16/biomedical-translation-task.html "Biomedical Translation Taks"). Please check the [full report](http://www.aclweb.org/anthology/W/W16/W16-2301 "@InProceedings{bojar-EtAl:2016:WMT1,
  author    = {Bojar, Ond\v{r}ej  and  Chatterjee, Rajen and Federmann, Christian  and  Graham, Yvette  and  Haddow, Barry  and  Huck, Matthias  and  Jimeno Yepes, Antonio  and  Koehn, Philipp  and  Logacheva, Varvara  and  Monz, Christof  and  Negri, Matteo  and  Neveol, Aurelie  and  Neves, Mariana  and  Popel, Martin  and  Post,  Matt  and  Rubino, Raphael  and  Scarton, Carolina  and  Specia,  Lucia  and  Turchi, Marco  and  Verspoor, Karin  and  Zampieri,  Marcos},
  title     = {Findings of the 2016 Conference on Machine Translation},
  booktitle = {Proceedings of the First Conference on Machine Translation},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {131--198},
  url       = {http://www.aclweb.org/anthology/W/W16/W16-2301}
}") to find out more about the data used for the Biomedical Translation Task and other tasks.