## Discourse segmentation

(Customizes the multilingual RST segmentation system introduced in https://www.aclweb.org/anthology/W19-2715/ )

Create new train and test sets for Russian; make up new models configs and modify train/evaluation scripts.

Requires:

 - code for the original paper in ``../tony/``
 
Output:

 - rus.rst.rrt_train.conll
 - rus.rst.rrt_dev.conll
 - rus.rst.rrt_test.conll
 - configs & scripts for segmentation models

<div class="alert alert-block alert-warning">
<b>Note:</b> some files are parsed incorrectly, so double check the generated data. The last corrected *.conll output of this notebook is archived in <b>corpus/corrected_segmentation_data.zip</b>.
</div>

In [None]:
! pip install -U git+https://github.com/IINemo/isanlp.git@discourse

### 1. Prepare dataset for model training and evaluation 

In [None]:
from utils.file_reading import read_edus
from utils.file_reading import SYMBOL_MAP

def prepare_token(token):
    for key, value in SYMBOL_MAP.items():
        token = token.replace(key, value)
        
    return token

In [None]:
def annot2tags(annot, edus):
    tags = []
    cursor = 0

#     for i, edu in enumerate(edus):
#         if prepare_token(edu).find(prepare_token(annot['text'][annot['tokens'][0].begin:annot['tokens'][0].end])) != -1:
#             cursor = i         
    
    for sentence in range(len(annot['sentences'])):
        sentence_tags = []
        previous_first_token = 0
        previous_edu = ''

        for token in range(annot['sentences'][sentence].begin, annot['sentences'][sentence].end):

            if cursor == len(edus):
                is_first_token = False

            else:
                is_first_token = False
                
                tmp_edu = prepare_token(edus[cursor])
                annot['text'] = annot['text'].replace('билие ', 'Обилие ')
                original_text = annot['text'][annot['tokens'][token].begin:annot['tokens'][token].end]
                original_text = prepare_token(original_text).strip()

                if tmp_edu.startswith(original_text):
                    if previous_edu:
                        if prepare_token(annot['text'][annot['tokens'][previous_first_token].begin:annot['tokens'][
                        token].begin].strip()) == previous_edu or original_text.lower() in ["сначала", "кватахеви", 
                                                                                           "целый", "для", 
                                                                                           "максимальный", "тогда",
                                                                                           "два", "исследованием",
                                                                                            "хотя", "хоть",
                                                                                            "обилие", "активное",
                                                                                            "менее"
                                                                                           ] or tmp_edu in [
                            "и два внешних кольца (α, α).",
                            "У этого кольца самый высокий эксцентриситет из всех,",
                            prepare_token("Средний размер частичек в этом кольце 0,2— 20 метров,"),
                            "(Перевод Нгуен Тхи Тхуи Чам))",
                            "в Чувашии состоялся необъявленный праздник саха культуры.",
                            "и выпустила сборник якутской поэзии «Жемчужина Сахи» (Саха ахахӗ, 1996),",
                            "Однако и песня ему кажется какой-то бесцветной. Видимо, потому, что она творится только голосом, а не душой [Там же];",
                            "и может использоваться, как и все конструкции, содержащие турцизмы,",
                            prepare_token("Новобранщине - солдатчине, Эполетщине - бобёрщине, Всей пехотщине, поморщине, менее Хлеборобщине, военщине, Что с армейской долей венчаны более "),
                            "который фильтрует вредные ультрафиолетовые лучи Солнца.",
                            "а также крем-уход для волос «Пептиды шелка и иранская хна».",
                            "К 2010 году около 100 озоноразрушающих веществ, включая ХФУ, будут сняты с производства повсеместно."
                        ]:
                            is_first_token = True
                            previous_first_token = token
                            previous_edu = tmp_edu
                            cursor += 1
                    else:
                        is_first_token = True
                        previous_first_token = token
                        previous_edu = tmp_edu
                        cursor += 1

            tag = 'BeginSeg=Yes' if is_first_token else '_'
            sentence_tags.append(tag)

        tags.append(sentence_tags)

    return tags

In [None]:
from isanlp.utils.annotation_conll_converter import AnnotationCONLLConverter

converter = AnnotationCONLLConverter()

In [None]:
from isanlp.annotation import Token, Sentence


def split_by_paragraphs(annot_text, annot_tokens, annot_sentences, annot_lemma, annot_morph, annot_postag, annot_ud_postag,
                 annot_syntax_dep_tree):

        def split_on_two(sents, boundary):
            list_sum = lambda l: sum([len(sublist) for sublist in l])

            i = 1
            while list_sum(sents[:i]) < boundary and i < len(sents):
                i += 1

            intersentence_boundary = min(len(sents[i - 1]), boundary - list_sum(sents[:i - 1]))
            return (sents[:i - 1] + [sents[i - 1][:intersentence_boundary]], 
                    [sents[i - 1][intersentence_boundary:]] + sents[i:])
        
        def recount_sentences(chunk):
            sentences = []
            lemma = []
            morph = []
            postag = []
            ud_postag = []
            syntax_dep_tree = []
            tokens_cursor = 0
            local_cursor = 0

            for i, sent in enumerate(chunk['syntax_dep_tree']):
                if len(sent) > 0:
                    sentences.append(Sentence(tokens_cursor, tokens_cursor + len(sent)))
                    lemma.append(chunk['lemma'][i])
                    morph.append(chunk['morph'][i])
                    postag.append(chunk['postag'][i])
                    ud_postag.append(chunk['ud_postag'][i])
                    syntax_dep_tree.append(chunk['syntax_dep_tree'][i])
                    tokens_cursor += len(sent)

            chunk['sentences'] = sentences
            chunk['lemma'] = lemma
            chunk['morph'] = morph
            chunk['postag'] = postag
            chunk['ud_postag'] = ud_postag
            chunk['syntax_dep_tree'] = syntax_dep_tree
            
            return chunk

        chunks = []
        prev_right_boundary = -1

        for i, token in enumerate(annot_tokens[:-1]):

            if '\n' in annot_text[token.end:annot_tokens[i + 1].begin]:
                if prev_right_boundary > -1:
                    chunk = {
                        'text': annot_text[annot_tokens[prev_right_boundary].end:token.end + 1].strip(),
                        'tokens': annot_tokens[prev_right_boundary + 1:i + 1]
                    }
                else:
                    chunk = {
                        'text': annot_text[:token.end + 1].strip(),
                        'tokens': annot_tokens[:i + 1]
                    }

                lemma, annot_lemma = split_on_two(annot_lemma, i - prev_right_boundary)
                morph, annot_morph = split_on_two(annot_morph, i - prev_right_boundary)
                postag, annot_postag = split_on_two(annot_postag, i - prev_right_boundary)
                ud_postag, annot_ud_postag = split_on_two(annot_ud_postag, i - prev_right_boundary)
                syntax_dep_tree, annot_syntax_dep_tree = split_on_two(annot_syntax_dep_tree, i - prev_right_boundary)

                chunk.update({
                    'lemma': lemma,
                    'morph': morph,
                    'postag': postag,
                    'ud_postag': ud_postag,
                    'syntax_dep_tree': syntax_dep_tree,
                })
                chunks.append(recount_sentences(chunk))

                prev_right_boundary = i  # number of last token in the last chunk

        chunk = {
            'text': annot_text[annot_tokens[prev_right_boundary].end:].strip(),
            'tokens': annot_tokens[prev_right_boundary + 1:],
            'lemma' : annot_lemma,
            'morph': annot_morph,
            'postag': annot_postag,
            'ud_postag': annot_ud_postag,
            'syntax_dep_tree': annot_syntax_dep_tree,
        }
        
        chunks.append(recount_sentences(chunk))
        return chunks

In [None]:
from glob import glob
from tqdm.autonotebook import tqdm
from utils.file_reading import read_annotation, read_edus
import re
from utils.train_test_split import split_train_dev_test

train, dev, test = split_train_dev_test('./data')
TRAIN_FILE = 'rus.rst.rrt_train.conll'
DEV_FILE = 'rus.rst.rrt_dev.conll'
TEST_FILE = 'rus.rst.rrt_test.conll'
MAX_LEN = 230


def preprocess(files, train=True, dev=False):
    print(f'preprocess {"train" if train else "test"} set')

    output_file = DEV_FILE if dev else TRAIN_FILE if train else TEST_FILE
    with open(output_file, 'w') as fo:
        for filename in tqdm(files):
            filename = filename.replace('.edus', '')
            annot = read_annotation(filename)  # split as well  ToDO:
            edus = read_edus(filename)
            last_edu = 0
            # tags = annot2tags(annot, edus)

            for i, chunk in enumerate(split_by_paragraphs(  # self,
                    annot['text'],
                    annot['tokens'],
                    annot['sentences'],
                    annot['lemma'],
                    annot['morph'],
                    annot['postag'],
                    annot['ud_postag'],
                    annot['syntax_dep_tree'])):

                sentence = 0
                token = 0
                chunk['text'] = annot['text']
                #edus = 
                tags = annot2tags(chunk, edus[last_edu:])
                
                for string in converter(filename.replace('data/', ''), chunk):
                    #print(string)
                    if string.startswith('# newdoc id ='):
                        sentence = 0
                        token = 0
                        fo.write(string + '\n')

                    elif string == '\n':
                        fo.write(string)
                        sentence += 1
                        token = 0

                    else:
                        if ' ' in string:
                            string = re.sub(r' .*\t', '\t', string)
                        if 'www' in string:
                            string = re.sub(r'www[^\t]*', '_html_', string)
                        if 'http' in string:
                            string = re.sub(r'http[^ \t]*', '_html_', string)
                            
                        string = prepare_token(string)                        
                        fo.write(string + '\t' + tags[sentence][token] + '\n')
                        
                        if tags[sentence][token] != '_':
                            last_edu += 1
                        
                        token += 1

                    if token == MAX_LEN:
                        print(filename + ' ::: occured very long sentence; truncate to ' + str(MAX_LEN) + ' tokens.')
                        fo.write('\n')
                        sentence += 1
                        token = 0
                        break


preprocess(train)
preprocess(dev, dev=True)
preprocess(test, train=False)

In [None]:
%%bash -s "$TRAIN_FILE" "$DEV_FILE" "$TEST_FILE"

export TONY_PATH="../tony/"

cp ${1} ${TONY_PATH}/data/rus.rst.rrt/${1}
cp ${2} ${TONY_PATH}/data/rus.rst.rrt/${2}
cp ${3} ${TONY_PATH}/data/rus.rst.rrt/${3}

1. Baseline model (BERT-M)

In [None]:
%%writefile ../tony/code/contextual_embeddings/configs/bertM.jsonnet


// Configuration for a named entity recognization model based on:
//   Peters, Matthew E. et al. “Deep contextualized word representations.” NAACL-HLT (2018).
{
  "dataset_reader": {
    "type": "conll2003",
    "tag_label": "ner",
    "coding_scheme": "BIOUL",
    "token_indexers": {
      "bert": {
          "type": "bert-pretrained",
          "pretrained_model": std.extVar("BERT_VOCAB"),
          "do_lowercase": false,
          "use_starting_offsets": true
      },
      "token_characters": {
        "type": "characters",
        "min_padding_length": 3
      }
    }
  },
  "train_data_path": std.extVar("TRAIN_DATA_PATH"),
  "validation_data_path": std.extVar("TEST_A_PATH"),
  "model": {
    "type": "simple_tagger",
    "text_field_embedder": {
        "allow_unmatched_keys": true,
        "embedder_to_indexer_map": {
            "bert": ["bert", "bert-offsets"],
            "token_characters": ["token_characters"],
        },
        "token_embedders": {
            "bert": {
                "type": "bert-pretrained",
                "pretrained_model": std.extVar("BERT_WEIGHTS")
            },
            "token_characters": {
                "type": "character_encoding",
                "embedding": {
                    "embedding_dim": 16
                },
                "encoder": {
                    "type": "cnn",
                    "embedding_dim": 16,
                    "num_filters": 128,
                    "ngram_filter_sizes": [3],
                    "conv_layer_activation": "relu"
                }
            }
        }
    },
    "encoder": {
        "type": "lstm",
        "input_size": 768 + 128,
        "hidden_size": 100,
        "num_layers": 1,
        "dropout": 0.5,
        "bidirectional": true
    },
  },
  "iterator": {
    "type": "basic",
    "batch_size": 2
  },
  "trainer": {
    "optimizer": {
        "type": "bert_adam",
        "lr": 0.001
    },
    "num_serialized_models_to_keep": 3,
    "num_epochs": 10,
    "grad_norm": 5.0,
    "patience": 3,
    "cuda_device": 1
  }
}


2. CRF model (BERT-M)

In [None]:
%%writefile ../tony/code/contextual_embeddings/configs/bertM_crf.jsonnet

// Configuration for a named entity recognization model based on:
//   Peters, Matthew E. et al. “Deep contextualized word representations.” NAACL-HLT (2018).
{
  "dataset_reader": {
    "type": "conll2003",
    "tag_label": "ner",
    "coding_scheme": "BIOUL",
    "token_indexers": {
      "bert": {
          "type": "bert-pretrained",
          "pretrained_model": "bert-base-multilingual-cased",
          "do_lowercase": false,
          "use_starting_offsets": true
      },
      "token_characters": {
        "type": "characters",
        "min_padding_length": 3
      }
    }
  },
  "train_data_path": std.extVar("TRAIN_DATA_PATH"),
  "validation_data_path": std.extVar("TEST_A_PATH"),
  "model": {
    "type": "crf_tagger",
    "dropout": 0.2,
    "calculate_span_f1": true,
    "label_encoding": "BIOUL",
    "text_field_embedder": {
        "allow_unmatched_keys": true,
        "embedder_to_indexer_map": {
            "bert": ["bert", "bert-offsets"],
            "token_characters": ["token_characters"],
        },
        "token_embedders": {
            "bert": {
                "type": "bert-pretrained",
                "pretrained_model": "bert-base-multilingual-cased",
            },
            "token_characters": {
                "type": "character_encoding",
                "embedding": {
                    "embedding_dim": 16
                },
                "encoder": {
                    "type": "cnn",
                    "embedding_dim": 16,
                    "num_filters": 128,
                    "ngram_filter_sizes": [3],
                    "conv_layer_activation": "relu",
                },
                "dropout": 0.2,
            },
        }
    },
    "encoder": {
        "type": "lstm",
        "input_size": 768 + 128,
        "hidden_size": 100,
        "num_layers": 1,
        "dropout": 0.5,
        "bidirectional": true
    },
  },
  "iterator": {
    "type": "basic",
    "batch_size": 2
  },
  "trainer": {
    "optimizer": {
        "type": "bert_adam",
        "lr": 0.001
    },
    "validation_metric": "+f1-measure-overall",
    "num_serialized_models_to_keep": 3,
    "num_epochs": 10,
    "grad_norm": 5.0,
    "patience": 3,
    "cuda_device": 1
  }
}


2. CRF model (ELMo)

ELMo embedder: Place ``model.hdf5`` and ``options.json`` files from ``http://vectors.nlpl.eu/repository/20/195.zip`` in ``models/rsv_elmo/`` folder.

In [None]:
%%writefile ../tony/code/contextual_embeddings/configs/elmo.jsonnet

// Configuration for the NER model with ELMo, modified slightly from
// the version included in "Deep Contextualized Word Representations",
// taken from AllenNLP examples
// modified for the disrpt discourse segmentation shared task -- 2019 
{
  "dataset_reader": {
    "type": "conll2003",
    "tag_label": "ner",
    "coding_scheme": "BIOUL",
    "token_indexers": {
      "token_characters": {
        "type": "characters",
        "min_padding_length": 3
      },
      "elmo": {
        "type": "elmo_characters"
     }
    }
  },
  "train_data_path": std.extVar("TRAIN_DATA_PATH"),
  "validation_data_path": std.extVar("TEST_A_PATH"),
  "model": {
    "type": "crf_tagger",
    "dropout": 0.2,
    "calculate_span_f1": true,
    "label_encoding": "BIOUL",
    "text_field_embedder": {
      "token_embedders": {
        "elmo":{
            "type": "elmo_token_embedder",
            "options_file": "rsv_elmo/options.json",
            "weight_file": "rsv_elmo/model.hdf5",
            "do_layer_norm": false,
            "dropout": 0.0
        },
        "token_characters": {
            "type": "character_encoding",
            "embedding": {
                "embedding_dim": 16
            },
            "encoder": {
                "type": "cnn",
                "embedding_dim": 16,
                "num_filters": 128,
                "ngram_filter_sizes": [3],
                "conv_layer_activation": "relu"
            },
            "dropout": 0.2
        }
      }
    },
    "encoder": {
      "type": "lstm",
      "input_size": 1024+128,
      "hidden_size": 100,
      "num_layers": 1,
      "dropout": 0.5,
      "bidirectional": true
    },
    "regularizer": [
      [
        "scalar_parameters",
        {
          "type": "l2",
          "alpha": 0.01,
        }
      ]
    ]
  },
  "iterator": {
    "type": "basic",
    "batch_size": 2
  },
  "trainer": {
    "optimizer": {
        "type": "adam",
        "lr": 0.001
    },
    "validation_metric": "+f1-measure-overall",
    "num_serialized_models_to_keep": 3,
    "num_epochs": 10,
    "grad_norm": 5.0,
    "patience": 3,
    "cuda_device": 1
  }
}

3. CRF model (ELMo+fastText)

fastText embedder: place ``http://files.deeppavlov.ai/embeddings/ft_native_300_ru_wiki_lenta_nltk_word_tokenize/ft_native_300_ru_wiki_lenta_nltk_word_tokenize.vec`` in ``models/``

In [76]:
%%writefile ../tony/code/contextual_embeddings/configs/elmo_ft.jsonnet

// Configuration for the NER model with ELMo, modified slightly from
// the version included in "Deep Contextualized Word Representations",
// taken from AllenNLP examples
// modified for the disrpt discourse segmentation shared task -- 2019 
{

  "dataset_reader": {
    "type": "conll2003",
    "tag_label": "ner",
    "coding_scheme": "BIOUL",
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": false
      },
      "token_characters": {
        "type": "characters",
        "min_padding_length": 3
      },
      "elmo": {
        "type": "elmo_characters"
     }
    }
  },
  "train_data_path": std.extVar("TRAIN_DATA_PATH"),
  "validation_data_path": std.extVar("TEST_A_PATH"),
  "model": {
    "type": "crf_tagger",
    "dropout": 0.2,
    "calculate_span_f1": true,
    "label_encoding": "BIOUL",
    "text_field_embedder": {
      "token_embedders": {
        "tokens": {
            "type": "embedding",
            "embedding_dim": 300,
            "pretrained_file": "ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec",
            "trainable": false
        },
        "elmo":{
            "type": "elmo_token_embedder",
            "options_file": "rsv_elmo/options.json",
            "weight_file": "rsv_elmo/model.hdf5",
            "do_layer_norm": false,
            "dropout": 0.0
        },
        "token_characters": {
            "type": "character_encoding",
            "embedding": {
                "embedding_dim": 16
            },
            "encoder": {
                "type": "cnn",
                "embedding_dim": 16,
                "num_filters": 128,
                "ngram_filter_sizes": [3],
                "conv_layer_activation": "relu"
            },
            "dropout": 0.25
        }
      }
    },
    "encoder": {
      "type": "lstm",
      "input_size": 1024+128+300,
      "hidden_size": 100,
      "num_layers": 2,
      "dropout": 0.5,
      "bidirectional": true
    },
    "regularizer": [
      [
        "scalar_parameters",
        {
          "type": "l2",
          "alpha": 0.01,
        }
      ]
    ]
  },
  "iterator": {
    "type": "basic",
    "batch_size": 2
  },
  "trainer": {
    "optimizer": {
        "type": "adam",
        "lr": 0.001
    },
    "validation_metric": "+f1-measure-overall",
    "num_serialized_models_to_keep": 3,
    "num_epochs": 10,
    "grad_norm": 5.0,
    "patience": 3,
    "cuda_device": 1
  }
}

Overwriting ../tony/code/contextual_embeddings/configs/elmo_ft.jsonnet


4. CRF model (ELMo + RuBERT)

RuBERT embedder: unpack ``http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_pt.tar.gz`` in ``models/``

In [None]:
%%writefile ../tony/code/contextual_embeddings/configs/stacked.jsonnet

// Configuration for the NER model with ELMo, modified slightly from
// the version included in "Deep Contextualized Word Representations",
// taken from AllenNLP examples
// modified for the disrpt discourse segmentation shared task -- 2019 
{

  "dataset_reader": {
    "type": "conll2003",
    "tag_label": "ner",
    "coding_scheme": "BIOUL",
    "token_indexers": {
      //"tokens": {
      //  "type": "single_id",
      //  "lowercase_tokens": true
      //},
      "token_characters": {
        "type": "characters",
        "min_padding_length": 3
      },
      "elmo": {
        "type": "elmo_characters"
     },
      "bert": {
          "type": "bert-pretrained",
          "pretrained_model": std.extVar("BERT_VOCAB"),
          "do_lowercase": false,
          "use_starting_offsets": true
      },
    }
  },
  "train_data_path": std.extVar("TRAIN_DATA_PATH"),
  "validation_data_path": std.extVar("TEST_A_PATH"),
  "model": {
    "type": "crf_tagger",
    "dropout": 0.2,
    "calculate_span_f1": true,
    "label_encoding": "BIOUL",
    "text_field_embedder": {
        "allow_unmatched_keys": true,
        "embedder_to_indexer_map": {
            "bert": ["bert", "bert-offsets"],
            "token_characters": ["token_characters"],
            "elmo": ["elmo"],
            "tokens": ["tokens"],
        },
      "token_embedders": {
        //"tokens": {
        //    "type": "embedding",
        //    "embedding_dim": 300,
        //    "pretrained_file": "ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec",
        //    "trainable": true
        //},
        "elmo":{
            "type": "elmo_token_embedder",
            "options_file": "rsv_elmo/options.json",
            "weight_file": "rsv_elmo/model.hdf5",
            "do_layer_norm": false,
            "dropout": 0.0
        },
        "bert": {
                "type": "bert-pretrained",
                "pretrained_model": std.extVar("BERT_WEIGHTS"),
                "requires_grad": true,
                "top_layer_only": false
            },
        "token_characters": {
            "type": "character_encoding",
            "embedding": {
                "embedding_dim": 16
            },
            "encoder": {
                "type": "cnn",
                "embedding_dim": 16,
                "num_filters": 128,
                "ngram_filter_sizes": [3],
                "conv_layer_activation": "relu"
            },
            "dropout": 0.2
        }
      }
    },
    "encoder": {
      "type": "lstm",
      "input_size": 1024+128+768,
      "hidden_size": 200,
      "num_layers": 2,
      "dropout": 0.5,
      "bidirectional": true
    },
    "regularizer": [
            [
                "scalar_parameters",
                {
                    "alpha": 0.01,
                    "type": "l2"
                }
            ]
    ]
  },
  "iterator": {
    "type": "basic",
    "batch_size": 2
  },
  "trainer": {
        "optimizer": {
            "type": "bert_adam",
            "lr": 0.001
        },
    "validation_metric": "+f1-measure-overall",
    "num_serialized_models_to_keep": 3,
    "num_epochs": 10,
    "grad_norm": 5.0,
    "patience": 2,
    "cuda_device": 1
  }
}

5. RuBERT

In [None]:
%%writefile ../tony/code/contextual_embeddings/configs/rubert.jsonnet

// Configuration for the NER model with ELMo, modified slightly from
// the version included in "Deep Contextualized Word Representations",
// taken from AllenNLP examples
// modified for the disrpt discourse segmentation shared task -- 2019 
{

  "dataset_reader": {
    "type": "conll2003",
    "tag_label": "ner",
    "coding_scheme": "BIOUL",
    "token_indexers": {
      //"tokens": {
      //  "type": "single_id",
      //  "lowercase_tokens": true
      //},
      "token_characters": {
        "type": "characters",
        "min_padding_length": 3
      },
#       "elmo": {
#         "type": "elmo_characters"
#      },
      "bert": {
          "type": "bert-pretrained",
          "pretrained_model": std.extVar("BERT_VOCAB"),
          "do_lowercase": false,
          "use_starting_offsets": true
      },
    }
  },
  "train_data_path": std.extVar("TRAIN_DATA_PATH"),
  "validation_data_path": std.extVar("TEST_A_PATH"),
  "model": {
    "type": "crf_tagger",
    "dropout": 0.2,
    "calculate_span_f1": true,
    "label_encoding": "BIOUL",
    "text_field_embedder": {
        "allow_unmatched_keys": true,
        "embedder_to_indexer_map": {
            "bert": ["bert", "bert-offsets"],
            "token_characters": ["token_characters"],
            "elmo": ["elmo"],
            "tokens": ["tokens"],
        },
      "token_embedders": {
        //"tokens": {
        //    "type": "embedding",
        //    "embedding_dim": 300,
        //    "pretrained_file": "ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec",
        //    "trainable": true
        //},
#         "elmo":{
#             "type": "elmo_token_embedder",
#             "options_file": "rsv_elmo/options.json",
#             "weight_file": "rsv_elmo/model.hdf5",
#             "do_layer_norm": false,
#             "dropout": 0.0
#         },
        "bert": {
                "type": "bert-pretrained",
                "pretrained_model": std.extVar("BERT_WEIGHTS"),
                "requires_grad": true,
                "top_layer_only": false
            },
        "token_characters": {
            "type": "character_encoding",
            "embedding": {
                "embedding_dim": 16
            },
            "encoder": {
                "type": "cnn",
                "embedding_dim": 16,
                "num_filters": 128,
                "ngram_filter_sizes": [3],
                "conv_layer_activation": "relu"
            },
            "dropout": 0.2
        }
      }
    },
    "encoder": {
      "type": "lstm",
      "input_size": 128+768,
      "hidden_size": 100,
      "num_layers": 1,
      "dropout": 0.5,
      "bidirectional": true
    },
    "regularizer": [
            [
                "scalar_parameters",
                {
                    "alpha": 0.01,
                    "type": "l2"
                }
            ]
    ]
  },
  "iterator": {
    "type": "basic",
    "batch_size": 2
  },
  "trainer": {
        "optimizer": {
            "type": "bert_adam",
            "lr": 0.001
        },
    "validation_metric": "+f1-measure-overall",
    "num_serialized_models_to_keep": 3,
    "num_epochs": 10,
    "grad_norm": 5.0,
    "patience": 3,
    "cuda_device": 1
  }
}

In [None]:
%%writefile ../tony/code/contextual_embeddings/expes.sh
# usage
#  sh expes.sh dataset config model

echo "data=$1, config=$2, model=$3"
   
export DATASET=${1}
# eg "eng.rst.gum"

export CONFIG=${2}
# options: conll tok split.tok wend.tok
#
export MODEL=${3}
#options: bert elmo bertM

if [ "$MODEL"="bertM" ] || [ "$MODEL"="bertM_crf" ]; 
then 
    export BERT_VOCAB="bert-base-multilingual-cased"
    export BERT_WEIGHTS="bert-base-multilingual-cased"
else
    # russian models
    #export BERT_VOCAB="http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_pt.tar.gz"
    #export BERT_WEIGHTS="http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_pt.tar.gz"
    export BERT_VOCAB="rubert_cased_L-12_H-768_A-12_pt"
    export BERT_WEIGHTS="rubert_cased_L-12_H-768_A-12_pt"
fi

# train and evaluate on dev set
export EVAL=dev
export GOLD_BASE="../../data/"
export CONV="data_converted/"
export TRAIN_DATA_PATH=${CONV}${DATASET}"_train.ner."${CONFIG}
export TEST_A_PATH=${CONV}${DATASET}"_"${EVAL}".ner."${CONFIG}
export OUTPUT=${DATASET}"_"${MODEL}
export GOLD=${GOLD_BASE}${DATASET}"/"${DATASET}"_"${EVAL}"."${CONFIG}

if [ ! -f ${CONV}${DATASET}"_train.ner."${CONFIG} ]; then
    echo "converting to ner format -> in data_converted ..."
    python conv2ner.py "../../data/"${DATASET}"/"${DATASET}"_train."${CONFIG} > ${CONV}/${DATASET}"_train.ner."${CONFIG}
    python conv2ner.py "../../data/"${DATASET}"/"${DATASET}"_test."${CONFIG} > ${CONV}/${DATASET}"_test.ner."${CONFIG}
    python conv2ner.py "../../data/"${DATASET}"/"${DATASET}"_dev."${CONFIG} > ${CONV}/${DATASET}"_dev.ner."${CONFIG}
fi

#python conv2ner.py "../../data/"${DATASET}"/"${DATASET}"_"${EVAL}"."${CONFIG} > ${CONV}/${DATASET}"_"${EVAL}".ner."${CONFIG}
# train with config in ner_elmo ou ner_bert.jsonnet; the config references explicitely variables TRAIN_DATA_PATH and TEST_A_PATH
allennlp train -s Results_${CONFIG}/results_${OUTPUT} configs/${MODEL}.jsonnet

# temporary skip evaluation on dev set
# predict with model -> outputs json
allennlp predict --use-dataset-reader --silent --output-file Results_${CONFIG}/results_${OUTPUT}/${DATASET}_${EVAL}.predictions.json Results_${CONFIG}/results_${OUTPUT}/model.tar.gz ${TEST_A_PATH}
# convert to disrpt format 
python json2conll.py Results_${CONFIG}/results_${OUTPUT}/${DATASET}_${EVAL}.predictions.json ${CONFIG} > Results_${CONFIG}/results_${OUTPUT}/${DATASET}_${EVAL}.predictions.${CONFIG}
# eval with disrpt script
python ../utils/seg_eval.py $GOLD Results_${CONFIG}/results_${OUTPUT}/${DATASET}_${EVAL}.predictions.${CONFIG} >> Results_${CONFIG}/results_${OUTPUT}/${DATASET}_${EVAL}.scores

export EVAL=test
export TEST_A_PATH=${CONV}${DATASET}"_"${EVAL}".ner."${CONFIG}
export OUTPUT=${DATASET}"_"${MODEL}
export GOLD=${GOLD_BASE}${DATASET}"/"${DATASET}"_"${EVAL}"."${CONFIG}
allennlp predict --use-dataset-reader --silent --output-file Results_${CONFIG}/results_${OUTPUT}/${DATASET}_${EVAL}.predictions.json Results_${CONFIG}/results_${OUTPUT}/model.tar.gz ${TEST_A_PATH}
#convert to disrpt format 
python json2conll.py Results_${CONFIG}/results_${OUTPUT}/${DATASET}_${EVAL}.predictions.json ${CONFIG} > Results_${CONFIG}/results_${OUTPUT}/${DATASET}_${EVAL}.predictions.${CONFIG}
# eval with disrpt script
python ../utils/seg_eval.py $GOLD Results_${CONFIG}/results_${OUTPUT}/${DATASET}_${EVAL}.predictions.${CONFIG} >> Results_${CONFIG}/results_${OUTPUT}/${DATASET}_${EVAL}.scores

### Train models 

In [None]:
%%writefile ../tony/code/contextual_embeddings/train_tony.sh

sh expes.sh rus.rst.rrt conll bertM
sh expes.sh rus.rst.rrt conll bertM_crf
#sh expes.sh rus.rst.rrt conll rubert
sh expes.sh rus.rst.rrt conll elmo
sh expes.sh rus.rst.rrt conll elmo_ft
#sh expes.sh rus.rst.rrt conll stacked

Go to contextual_embeddings path and run ``train_tony.sh``. 

Trained models along with evaluations appear in the path: ``tony/code/contextual_embeddings/Results_conll``