The input for the SRL is the SONAR-1 dataset.

We use the CONLL file prepared by Jisk (see https://github.com/Filter-Bubble/stroll), 

(TODO: also describe how to get from SONAR-1 to this conll files)

In [None]:
import os

In [24]:
root_dir = '/home/dafne/shared/FilterBubble/SRL/pipeline/'

## Step 0: split CONLLU in train/dev/test

The train file that results from this step can be used to train the stroll tagger. 

In [10]:
import conllu
import numpy as np

In [6]:
conllu_path = os.path.join(root_dir,'gold-conllu')
fname_in = os.path.join(conllu_path, 'sonar1_fixed.conll')

In [3]:
features = [
    'ID',
    'FORM',
    'LEMMA',
    'UPOS',
    'XPOS',
    'FEATS',
    'HEAD',
    'DEPREL',
    'DEPS',
    'MISC',
    'FRAME',
    'ROLE'
]
features = [f.lower() for f in features]

In [7]:
with open(fname_in) as fin:
    sentences = conllu.parse(fin.read(), fields=features)

In [11]:
nr_sentences = len(sentences)
perm = np.random.permutation(nr_sentences)
train_size = int(nr_sentences*0.8)
test_size = int(nr_sentences*0.1)
dev_size = nr_sentences - train_size - test_size
print('Train size: {}, dev size: {}, test size: {}'.format(train_size, dev_size, test_size))

Train size: 23249, dev size: 2907, test size: 2906


In [18]:
def write_conll(sentences, fn):
    with open(fn, 'w') as fout:
        for sent in sentences:
            fout.write(sent.serialize())
            fout.write('\n')

In [12]:
train_set = np.array(sentences)[perm[:train_size]].tolist()
dev_set = np.array(sentences)[perm[train_size:train_size+dev_size]].tolist()
test_set = np.array(sentences)[perm[train_size+dev_size:]].tolist()

In [19]:
#write_conll(train_set, os.path.join(conllu_path, 'sonar1_train.conll'))
#write_conll(dev_set, os.path.join(conllu_path, 'sonar1_dev.conll'))
#write_conll(test_set, os.path.join(conllu_path, 'sonar1_test.conll'))

## Step 1: from CONLLU with golden annotation to NAF

Use the [conll2naf](https://github.com/Filter-Bubble/FormatConversions/tree/master/conll2naf) conversion script:
```
python conll2naf --file_per_sent -o ~/output/path/train ~/input/path/sonar1_train.conll
python conll2naf --file_per_sent -o ~/output/path/dev ~/input/path/sonar1_dev.conll
python conll2naf --file_per_sent -o ~/output/path/test ~/input/path/sonar1_test.conll
```

The default is to create one NAF file for the complete conll file, just glueing all sentences together. The option `--file_per_sent` prevents this and writes one NAF file per sentence in the conll file.

In theory, we could use these NAF files to train/test the vua-srl tagger but `python nafAlpinoToSRLFeatures.py` gives an error, probably it is missing constituents or uses the wrong tags?

## Step 2: extract raw text

In [20]:
from KafNafParserPy import KafNafParser

In [21]:
raw_path = os.path.join(root_dir, 'raw')
if not os.path.exists(raw_path):
    os.mkdir(raw_path)
for s in ['train', 'dev', 'test']:
    if not os.path.exists(os.path.join(raw_path, s)):
        os.mkdir(os.path.join(raw_path, s))

In [22]:
naf_path = os.path.join(root_dir, 'gold-naf')

In [23]:
for s in ['train', 'dev', 'test']:
    for fname in os.listdir(os.path.join(naf_path, s)):
        fpath = os.path.join(naf_path, s, fname)
        naf_obj = KafNafParser(fpath)
        fname_out = os.path.splitext(fname)[0] + '.txt'
        with open(os.path.join(raw_path, s, fname_out), 'w') as fout:
            fout.write(naf_obj.get_raw())

## Step 3: Run pipeline with StanfordNLP

For this you need [this fork](https://github.com/Filter-Bubble/vu-rm-pip3) of the pipeline which is still under development.

In [None]:
stanfordnlp_path = os.path.join(root_dir, 'stanfordnlp-naf')
if not os.path.exists(stanfordnlp_path):
    os.mkdir(stanfordnlp_path)
for s in ['train', 'dev', 'test']:
    if not os.path.exists(os.path.join(stanfordnlp_path, s)):
        os.mkdir(os.path.join(stanfordnlp_path, s))

```bash
for fn in $(ls ~/shared/FilterBubble/SRL/pipeline/raw/dev); 
    do ./scripts/run-pipeline.sh -c cfg/pipeline_stanfordnlp.yml \
        < ~/shared/FilterBubble/SRL/pipeline/raw/dev/$fn  \
        > ~/shared/FilterBubble/SRL/pipeline/stanfordnlp-naf/dev/$fn.naf;
done
```

And similar for train, test


Now we can also compare with files in `gold-naf` to see if StanfordNLP did a good job (do we have an evaluate script for this for the dep parser?). They use the same tags.

## Step 4: Run pipeline with Alpino

In [None]:
alpino_path = os.path.join(root_dir, 'alpino-naf')
if not os.path.exists(alpino_path):
    os.mkdir(alpino_path)
for s in ['train', 'dev', 'test']:
    if not os.path.exists(os.path.join(alpino_path, s)):
        os.mkdir(os.path.join(alpino_path, s))

```bash
for fn in $(ls ~/shared/FilterBubble/SRL/pipeline/raw/dev); 
    do ./scripts/run-pipeline.sh -c cfg/pipeline_alpino.yml \
        < ~/shared/FilterBubble/SRL/pipeline/raw/dev/$fn  \
        > ~/shared/FilterBubble/SRL/pipeline/alpino-naf/$fn.naf;
done
```

## Step 5: convert files to conll

So that we can use them to evaluate the stroll parser

In [None]:
stanfordnlp_conll_path = os.path.join(root_dir, 'stanfordnlp-conll')
if not os.path.exists(stanfordnlp_conll_path):
    os.mkdir(stanfordnlp_conll_path)
for s in ['train', 'dev', 'test']:
    if not os.path.exists(os.path.join(stanfordnlp_conll_path, s)):
        os.mkdir(os.path.join(stanfordnlp_conll_path, s))

TODO: extend the naf2conll converter so that it outputs all fields needed by the SRL tagger!!