In [13]:
import os, sys
sys.path.append(os.getcwd() + '/src/')

In [14]:
from morph_dict_tools.universal import *
from ud_tools import *
from syntactic_patterns import *
from generate_data_multiprocess import parallel_replacement

# Building SPUD treebanks

In [15]:
# specify here the languages you want to build the spud for
# languages
langs = ["ar", "de", "en", "fr","ru"] # ["en"] 

In [16]:
# and the POS tags you want to include in the perturbation
upos_filter = ["NOUN", "PROPN", "VERB", "ADJ", "ADV"]

## Prerequisites

After you ran the preparation script, you should have downloaded and preprocessed the UD treebanks, and have the morphological dictionaries available as pickles. 

First, let's load the morphological dictionaries:

In [17]:
print('load morphdicts')
morphdicts = {lang: load_morphdict_from_pickle(lang) for lang in langs}

load morphdicts


Now, we can load the treebanks. 

In [18]:
tb_path_mod = "data/ud-mod/"
tb_path_orig = "data/ud-treebanks-v2.10/"
tb_paths = {
    "ar": {
        "train": tb_path_mod + "UD_Arabic-PADT/ar_padt-ud-train.conllu",
        "dev": tb_path_mod + "UD_Arabic-PADT/ar_padt-ud-dev.conllu",
        "test": tb_path_mod + "UD_Arabic-PADT/ar_padt-ud-test.conllu"},
    "de": {
        "train": tb_path_mod + "UD_German-HDT/de_hdt-ud-train.conllu", 
        "dev": tb_path_mod + "UD_German-HDT/de_hdt-ud-dev.conllu", 
        "test": tb_path_mod + "UD_German-HDT/de_hdt-ud-test.conllu"},
    "en": {
        "train": tb_path_mod + "UD_English-EWT/en_ewt-ud-train.conllu",
        "dev":   tb_path_mod + "UD_English-EWT/en_ewt-ud-dev.conllu",
        "test":  tb_path_mod + "UD_English-EWT/en_ewt-ud-test.conllu"},
    "fr": {
        "train": tb_path_mod + "UD_French-GSD/fr_gsd-ud-train.conllu",
        "dev": tb_path_mod + "UD_French-GSD/fr_gsd-ud-dev.conllu",
        "test": tb_path_mod + "UD_French-GSD/fr_gsd-ud-test.conllu"},
    "ru": {
        "train": tb_path_orig + "UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu",
        "dev": tb_path_orig + "UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu",
        "test": tb_path_orig + "UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu"},
}

We are loading the dev and test splits: 

In [19]:
treebanks = dict() 
for lang in langs:
    print('load treebanks for ', lang)
    treebanks[lang] = {
        "dev": load_ud_treebank(tb_paths[lang]["dev"]),
        "test": load_ud_treebank(tb_paths[lang]["test"])
    }

load treebanks for  ar
read file  data/ud-mod/UD_Arabic-PADT/ar_padt-ud-dev.conllu
parse data into token lists
apply cutoff of  None
convert token lists to trees
Done parsing
read file  data/ud-mod/UD_Arabic-PADT/ar_padt-ud-test.conllu
parse data into token lists
apply cutoff of  None
convert token lists to trees
Done parsing
load treebanks for  de
read file  data/ud-mod/UD_German-HDT/de_hdt-ud-dev.conllu
parse data into token lists
apply cutoff of  None
convert token lists to trees
Done parsing
read file  data/ud-mod/UD_German-HDT/de_hdt-ud-test.conllu
parse data into token lists
apply cutoff of  None
convert token lists to trees
Done parsing
load treebanks for  en
read file  data/ud-mod/UD_English-EWT/en_ewt-ud-dev.conllu
parse data into token lists
apply cutoff of  None
convert token lists to trees
Done parsing
read file  data/ud-mod/UD_English-EWT/en_ewt-ud-test.conllu
parse data into token lists
apply cutoff of  None
convert token lists to trees
Done parsing
load treebanks for  fr

In [20]:
# for each language, load the syntactic patterns which contain the syntactic contexts for replacements
syntactic_patterns = dict()
for lang in langs:
    pattern_config = pattern_configs[lang]
    dev_test_trees = treebanks[lang]["dev"][1] + treebanks[lang]["test"][1]
    syntactic_patterns[lang] = SyntacticPatterns(dev_test_trees, upos_filter=upos_filter, pattern_config=pattern_config)

## Stop making sense!

Now we are ready to build the SPUD treebanks. 
For this, we need to define the following parameters:
- The number of nonce versions we want to build per sentence (num_runs)
- The number of sentences we want to build, since this might take some time (cutoff)
- (optional, default=1) The number of parallel cpu processes we want to use. This depends on available RAM and language, since the morphological dictionaries per language have different sizes. The numbers here are fit for 64 GB RAM + SWAP, so decrease them if you have less RAM available (lang2processes)

The cell below generates SPUD test sets

In [23]:
lang2run2newsents = {lang:dict() for lang in langs}
# generate num_runs versions. This might take a while, you might want to specify a cutoff for the number of sentences to generate
num_runs = 3
cutoff = 100
split = "dev"
lang2processes = {
    "ar": 2,
    "de": 6,
    "en": 3,
    "fr": 2,
    "ru": 4,
}

In [24]:
lang2run2newsents = {lang:dict() for lang in langs}

for lang in langs:
    num_processes = 1 # lang2processes[lang]
    print('replace tokens in dev', lang)
    for i in range(num_runs):
        print('run ', i)
        new_sents = parallel_replacement(
            lang=lang,
            sents=treebanks[lang][split][0][:cutoff],
            trees=treebanks[lang][split][1][:cutoff],
            morphdict=morphdicts[lang],
            synt_patterns=syntactic_patterns[lang],
            upos_filter=upos_filter,
            num_processes=num_processes)
        lang2run2newsents[lang][i] = new_sents

replace tokens in dev ar
run  0
num_processes 1
treebank size 100
slice_size 101
build and start processes
i 0 start 0 end 100
0, join processes
join process 0
Created  100  new sentences
Saving sentences to conllu file
Done with slice  0
done creating. Now reloading and merging
0
read file  _ar_sents_slice_0.conllu
parse data into token lists
apply cutoff of  None
remove tmp files
run  1
num_processes 1
treebank size 100
slice_size 101
build and start processes
i 0 start 0 end 100
0, join processes
join process 0
Created  100  new sentences
Saving sentences to conllu file
Done with slice  0
done creating. Now reloading and merging
0
read file  _ar_sents_slice_0.conllu
parse data into token lists
apply cutoff of  None
remove tmp files
run  2
num_processes 1
treebank size 100
slice_size 101
build and start processes
i 0 start 0 end 100
0, join processes
join process 0
Created  100  new sentences
Saving sentences to conllu file
Done with slice  0
done creating. Now reloading and merging


In [26]:
out_dir_prefix = "data/spud/"
for lang in langs:
    for r in range(num_runs):
        # print state with flush
        print(f"write {lang} run {r}", end="\r", flush=True)
        out_dir = f"{out_dir_prefix}{lang}/{r}/"
        os.makedirs(out_dir, exist_ok=True)
        new_sents = lang2run2newsents[lang][r]
        serialize_sents_to_conllu_file(new_sents, f"{out_dir}spud_dev.conllu")

write ru run 2

## Extending to a new language

In principle, you need to be able to execute all steps in the above cells for the new language. This requires the following steps

- Implementing a Morphological Dictionary for this language
    - Understand the class for an existing one (e.g. `src/morph_dict_tools/udlex_french.py` is a good example) and adapt the method for preparing the dict from file with the UDLexicon of your language.
    - Then create a pickle of the morphdict by extending `prep/pickle_morphdicts.py` with the class of your new dictionary. 
- Add the treebank files to the paths as in this notebook above, and a two-letter language id to the list `langs`
- Add a syntactic pattern to the pattern config in `src/syntactic_patterns.py` (Documentation is provided there)
- And that's it! 