# Sequence-to-Sequence Applications

## Seq2seq recap:
- Sequence of words or tokens in, sequence of predictions out
    - MT: Sequence of English words in, sequence of Finnish words out
    - Compare to:
        - Sequence classification: Sequence of items in, one prediction out
        - Sequence Tagging/Labeling: Sequence of items in, one prediction per item out
    
- Seq2seq: Input and output do not need to be same length
  - e.g. for labeling there is only one output label!
- **Neural Attention**
    - Instead of a fixed-length vector representing encoded input, decoder has access to any part of the encoder state
    - Separate context vector (ci) (attention over all the input when doing a particular prediction)computed for each decoder state
    - Decoder steps can “pay attention” to different parts of the input

## Today's content (part 1):
- How to train complicated seq2seq models with very little coding effort
- Lemmatization as a seq2seq task
- Other seq2seq applications

## MT Frameworks/Libraries
- MT is one of the most widely studied seq2seq problems
- Many ready-made and well maintained libraries exist for Neural MT, e.g.
    - OpenNMT
    - MarianNMT
- Developed mainly for NMT, however, these do not have any MT specific hard-coded --> can be used to any seq2seq

## Why use ready-made libraries?
- Everything already implemented (different attention models, different encoder/decoder architectures)
- Top notch results/models very difficult to replicate (true even when code is open-source)

## Date Normalization with OpenNMT

### Download and read the dataset

In [1]:
!pip install OpenNMT-py # restart runtime and run cell again if errors occur



In [2]:
!wget -nc https://raw.githubusercontent.com/TurkuNLP/Deep_Learning_in_LangTech_course/master/data/generated_dates.txt

--2020-04-21 07:36:10--  https://raw.githubusercontent.com/TurkuNLP/Deep_Learning_in_LangTech_course/master/data/generated_dates.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2872063 (2.7M) [text/plain]
Saving to: ‘generated_dates.txt’


2020-04-21 07:36:11 (15.3 MB/s) - ‘generated_dates.txt’ saved [2872063/2872063]



In [3]:
def load_data(fname):
    data = []
    with open(fname, "rt", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            input_text, output_text = line.split("\t")
            data.append((input_text, output_text))
    return data

data = load_data("generated_dates.txt")

print("Number of examples:", len(data))
print("First examples:", data[:5])


Number of examples: 100000
First examples: [('tammikuun 18. 1987', '18.01.1987'), ('joulukuun 26. 1993', '26.12.1993'), ('KESÄKUUN 16. 2009', '16.06.2009'), ('1997/8/7', '07.08.1997'), ('9. päivänä Heinäkuuta 1981', '09.07.1981')]


### Create training and development files

- OpenNMT
    - command line tool (just inpot and output filrs needed), which can be used as a python library as well (but then you need to handle all steps youself)
    - requires text files as input
- Split data into train and development sets
- Data format:
    - One file for training input sequences, one file for training output sequences, line numbering must match
    - Items (words/characters/subwords) separated by whitespace


In [4]:
from sklearn.model_selection import train_test_split

def prepare_input(data):
  examples = []
  for input_, output_ in data:
    input_seq = " ".join(c for c in input_.replace(" ", "@")) # whitespace is item separator, so real whitespace must be represented differently
    output_seq = " ".join(c for c in output_.replace(" ", "@"))
    examples.append((input_seq, output_seq))
  return examples


def save_data(train_data, dev_data):
  # write data into files
  with open("train.input", "wt", encoding="utf-8") as train_input, open("train.output", "wt", encoding="utf-8") as train_output:
    for input_text, output_text in train_data:
      print(input_text, file=train_input)
      print(output_text, file=train_output)

  with open("dev.input", "wt", encoding="utf-8") as dev_input, open("dev.output", "wt", encoding="utf-8") as dev_output:
    for input_text, output_text in dev_data:
      print(input_text, file=dev_input)
      print(output_text, file=dev_output)



# prepare correct input representation (whitespace separated items) and save
data = prepare_input(data)
print("First examples:", data[:5])

train_data, dev_data = train_test_split(data, test_size=0.2, train_size=0.8, shuffle=True)
save_data(train_data, dev_data) # for OpenNMT



First examples: [('t a m m i k u u n @ 1 8 . @ 1 9 8 7', '1 8 . 0 1 . 1 9 8 7'), ('j o u l u k u u n @ 2 6 . @ 1 9 9 3', '2 6 . 1 2 . 1 9 9 3'), ('K E S Ä K U U N @ 1 6 . @ 2 0 0 9', '1 6 . 0 6 . 2 0 0 9'), ('1 9 9 7 / 8 / 7', '0 7 . 0 8 . 1 9 9 7'), ('9 . @ p ä i v ä n ä @ H e i n ä k u u t a @ 1 9 8 1', '0 9 . 0 7 . 1 9 8 1')]


## Train

In [5]:
# run OpenNMT preprocessing, this handles vectorization etc.
# ! --> run in command line using the commands pip installed
# pytorch backend used here in OpenNMT
!onmt_preprocess -train_src train.input -train_tgt train.output -valid_src dev.input -valid_tgt dev.output -save_data preprocessed-data -src_words_min_frequency 5 -tgt_words_min_frequency 5 -overwrite

# train model
BATCH_SIZE = 64
LEARNING_RATE = 0.001
OPTIMIZER = "adam"
TRAIN_STEPS = 3000 #step is one minibatch, so train for 3000 minibatches

print("How many epochs:", (TRAIN_STEPS*BATCH_SIZE)/len(train_data))

# save the model
!onmt_train -data preprocessed-data -save_model trained-model -gpu_ranks 0 -learning_rate {LEARNING_RATE} -batch_size {BATCH_SIZE} -optim {OPTIMIZER} -train_steps {TRAIN_STEPS} -save_checkpoint_steps {TRAIN_STEPS}

[2020-04-21 07:45:28,219 INFO] Extracting features...
[2020-04-21 07:45:28,220 INFO]  * number of source features: 0.
[2020-04-21 07:45:28,220 INFO]  * number of target features: 0.
[2020-04-21 07:45:28,220 INFO] Building `Fields` object...
[2020-04-21 07:45:28,220 INFO] Building & saving training data...
[2020-04-21 07:45:28,302 INFO] Building shard 0.
[2020-04-21 07:45:30,940 INFO]  * saving 0th train data shard to preprocessed-data.train.0.pt.
[2020-04-21 07:45:32,839 INFO]  * tgt vocab size: 15.
[2020-04-21 07:45:32,839 INFO]  * src vocab size: 49.
[2020-04-21 07:45:32,870 INFO] Building & saving validation data...
[2020-04-21 07:45:32,914 INFO] Building shard 0.
[2020-04-21 07:45:33,262 INFO]  * saving 0th valid data shard to preprocessed-data.valid.0.pt.
How many epochs: 2.4
[2020-04-21 07:45:37,183 INFO]  * src vocab size = 49
[2020-04-21 07:45:37,183 INFO]  * tgt vocab size = 15
[2020-04-21 07:45:37,183 INFO] Building model...
[2020-04-21 07:45:45,299 INFO] NMTModel(
  (encoder

## Predict

- Prediction also requires input files, let's write files on-the-fly with bash


In [6]:
#'1. helmikuuta 2003', 'Toukokuun 7. päivä 1995', '9. päivä huhtikuuta 2020'
!echo "1. helmikuuta 2003" | perl -pe 's/ /@/g' | perl -CS -pe 's/(.)/\1 /g' | tee "tmp.tmp"; onmt_translate -model trained-model_step_3000.pt -src tmp.tmp -output pred.txt ; cat pred.txt | perl -pe 's/ //g'
!echo "Toukokuun 7. päivä 1995" | perl -pe 's/ /@/g' | perl -CS -pe 's/(.)/\1 /g' | tee "tmp.tmp"; onmt_translate -model trained-model_step_3000.pt -src tmp.tmp -output pred.txt ; cat pred.txt | perl -pe 's/ //g'
!echo "9. päivä huhtikuuta 2020" | perl -pe 's/ /@/g' | perl -CS -pe 's/(.)/\1 /g' | tee "tmp.tmp"; onmt_translate -model trained-model_step_3000.pt -src tmp.tmp -output pred.txt ; cat pred.txt | perl -pe 's/ //g'

1 . @ h e l m i k u u t a @ 2 0 0 3 
[2020-04-21 07:54:02,059 INFO] Translating shard 0.
PRED AVG SCORE: -0.0000, PRED PPL: 1.0000
01.02.2003
T o u k o k u u n @ 7 . @ p ä i v ä @ 1 9 9 5 
[2020-04-21 07:54:05,574 INFO] Translating shard 0.
PRED AVG SCORE: -0.0000, PRED PPL: 1.0000
07.05.1995
9 . @ p ä i v ä @ h u h t i k u u t a @ 2 0 2 0 
[2020-04-21 07:54:09,008 INFO] Translating shard 0.
PRED AVG SCORE: -0.0000, PRED PPL: 1.0000
09.04.2020


# Recap: Running OpenNMT-py
- You need to create input sequence and output sequence text files
    - Line numbers must match
    - Items (characters/tokens) separated using whitespace
- Model type can be defined using command-line parameters

# Lemmatization

- For the given word (which may be inflected), determine its base form (dictionary form)
    - dogs --> dog
    - played --> play
    - talossa -> talo
    - lukisimme --> lukea
    - öiden --> yö

- In some languages words can heavily change when inflecting (it's not just adding suffixes)
- Exluding irregular words, inflections are not arbitrary, they follow certain language rules
    - Rules can be very complicated, but learnable
- Good fit for sequence to sequence models
- Simple approach: inflected word in --> lemma out
    - character level model where one character is one input unit
    - d o g s --> d o g 
    - l u k i s i m m e --> l u k e a

- Ambiguity:
    - lives --> live (VERB) or life (NOUN)
    - koirasta --> koira (Case=Ela) or koiras (Case=Par) or koi#rasta (Case=Nom, if such exists?)
    - We need context representation!
    - "En pidä hänen koirasta, koska se haukkuu liikaa." --> koira

- Context representations
  - Context as sliding window of text
      - p i d ä @ h ä n e n @ < k o i r a s t a > , @ k o s k a --> k o i r a
      - Context representation is very sparse
      - At the same time you need to learn how inflections work in Finnish, and to understand the context in order to generate the correct lemma
      - Works reasonably well if you have **a lot** of training data
  - **Context as morphological tags**
      - l i v e s VERB Mood=Ind Number=Sing Person=3 Tense=Pres VerbForm=Fin --> l i v e
      - l i v e s NOUN Number=Plur --> l i f e 
      - k o i r a s t a NOUN Case=Ela Number=Sing --> k o i r a
      - k o i r a s t a NOUN Case=Par Number=Sing --> k o i r a s
      - Compact context representation (better generalization)
      - If you already know these morphological tags (by running a tagger), you only need to learn how inflections work
      - Works very well also with less training data

- **Brain-teaser:** Why words are not suitable input and output units in lemmatization as such?
    - Lukisimme koko päivän kirjaa . --> lukea koko päivä kirja . (sequence labeling task, label set = lemma vocabulary)
    - You would need to learn a mapping between words and possible lemmas for it
      - if you see 'koirasta' in input, remember that possible lemmas are 'koira' and 'koiras', and predict one of these 'labels' based on context
    - Vocabulary is huge, and data is sparse
    - You would not be able to predict anything for unknown words (words not seen during training)
      - In practise, the model would pick a (random) lemma out of all lemmas, or predict  label unknown if trained to do so
    - "Herra Růžičkalla on uudenaikainen moottorivene ." --> "herra UNK olla UNK UNK ."
    - However, you can transform this into reasonable sequence labeling task by predicting e.g. edit-rules but this is another story...


## Let's train a Finnish seq2seq lemmatizer with morphological tags as context

- When we have the lemmatizer model trained, we can lemmatize new text by first running a tagger model to predict morphological tags, then run the lemmatizer for each word 

## Download data

- **treebank:** an annotated collection of text including annotation for text segmentation, part-of-speech and morphological features, lemmatization and syntactic relations
- Finnish: https://github.com/UniversalDependencies/UD_Finnish-TDT
- Other languages: https://universaldependencies.org




In [7]:
!wget https://raw.githubusercontent.com/UniversalDependencies/UD_Finnish-TDT/master/fi_tdt-ud-train.conllu
!wget https://raw.githubusercontent.com/UniversalDependencies/UD_Finnish-TDT/master/fi_tdt-ud-dev.conllu

--2020-04-21 08:20:27--  https://raw.githubusercontent.com/UniversalDependencies/UD_Finnish-TDT/master/fi_tdt-ud-train.conllu
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13443822 (13M) [text/plain]
Saving to: ‘fi_tdt-ud-train.conllu’


2020-04-21 08:20:28 (22.5 MB/s) - ‘fi_tdt-ud-train.conllu’ saved [13443822/13443822]

--2020-04-21 08:20:30--  https://raw.githubusercontent.com/UniversalDependencies/UD_Finnish-TDT/master/fi_tdt-ud-dev.conllu
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1511949 (1.4M) [text/plain]
Saving to: ‘fi_tdt-u

## Data preprocessing

What we need to do is:
- Extract words, lemmas and morphological features
- Create an **input sequence file** having the words (one character is one input item) and morphological features (one tag is one input item), one word per line
- Create an **output sequence file** having the lemma (one character is one output item), one lemma per line
- Line numbering must match

**Task setting:** Given an input sequence, predict the corresponding output sequence

In [0]:
# helper functions

"""
# sent_id = b101.1
# text = Kävelyreitti III
1 Kävelyreitti  kävely#reitti	NOUN	N	Case=Nom|Number=Sing	0	root	0:root	_
2	III	III	ADJ	Num	NumType=Ord	1	amod	1:amod	_

# sent_id = b101.2
# text = Jäällä kävely avaa aina hauskoja ja erikoisia näkökulmia kaupunkiin.
1	Jäällä	jää	NOUN	N	Case=Ade|Number=Sing	2	nmod	2:nmod	_
2	kävely	kävely	NOUN	N	Case=Nom|Derivation=U|Number=Sing	3	nsubj	3:nsubj	_
3	avaa	avata	VERB	V	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	0:root	_
4	aina	aina	ADV	Adv	_	3	advmod	3:advmod	_
5	hauskoja	hauska	ADJ	A	Case=Par|Degree=Pos|Number=Plur	8	amod	8:amod	_
6	ja	ja	CCONJ	C	_	7	cc	7:cc	_
7	erikoisia	erikoinen	ADJ	A	Case=Par|Degree=Pos|Derivation=Inen|Number=Plur	5	conj	5:conj|8:amod	_
8	näkökulmia	näkö#kulma	NOUN	N	Case=Par|Number=Plur	3	obj	3:obj	_
9	kaupunkiin	kaupunki	NOUN	N	Case=Ill|Number=Sing	8	nmod	8:nmod	SpaceAfter=No
10	.	.	PUNCT	Punct	_	3	punct	3:punct	_

"""

ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10)

def read_conllu(f):
    sent=[]
    comment=[]
    for line in f:
        line=line.strip()
        if not line: # new sentence
            if sent:
                yield comment,sent
            comment=[]
            sent=[]
        elif line.startswith("#"):
            comment.append(line)
        else: #normal line
            sent.append(line.split("\t"))
    else:
        if sent:
            yield comment, sent


def prepare_lemmatization_dataset(fname):
  # create input and output sequences (in text format)
  data = []
  with open(fname, "rt", encoding="utf-8") as f:
    for comments, sent in read_conllu(f):
      for token in sent:
        if "-" in token[ID] or "." in token[ID]: # multiword token or null node --> skip
          continue
        input_chars = " ".join(c for c in token[FORM]) # our translation model uses whitespace tokenization, so let's create character level model by inserting whitespaces
        features = " ".join([token[UPOS]] + token[FEATS].split("|")) # add morphological features
        input_chars = input_chars + " " + features
        lemma_chars = " ".join(c for c in token[LEMMA])
        data.append((input_chars, lemma_chars))
  return data



In [9]:
import random 

# transform data
train_data = prepare_lemmatization_dataset("fi_tdt-ud-train.conllu")
random.shuffle(train_data)

dev_data = prepare_lemmatization_dataset("fi_tdt-ud-dev.conllu")

print("First train examples:", train_data[:5])
print("Number of training examples:", len(train_data))
print("Number of unique training examples:", len(set(train_data)))
print("\nNumber of development examples:", len(dev_data))
print("Number of unique development examples:", len(set(dev_data)))

save_data(train_data, dev_data)

First train examples: [('a l a s t i ADV Derivation=Sti', 'a l a s t i'), ('j u o t t a a VERB Mood=Ind Number=Sing Person=3 Tense=Pres VerbForm=Fin Voice=Act', 'j u o t t a a'), ('M a i t o l a m m i k k o o n NOUN Case=Ill Number=Sing', 'm a i t o # l a m m i k k o'), (', PUNCT _', ','), ('t o r j u a VERB InfForm=1 Number=Sing VerbForm=Inf Voice=Act', 't o r j u a')]
Number of training examples: 162816
Number of unique training examples: 51100

Number of development examples: 18308
Number of unique development examples: 8662


In [10]:
# run preprocessing, this handles vectorization etc.
!onmt_preprocess -train_src train.input -train_tgt train.output -valid_src dev.input -valid_tgt dev.output -save_data preprocessed-data -src_words_min_frequency 5 -tgt_words_min_frequency 5 -overwrite
# preprocess: text to numbers

# train model

BATCH_SIZE = 64
LEARNING_RATE = 0.001
OPTIMIZER = "adam"
ENCODER = "brnn" # bidiectional rnn
TRAIN_STEPS = 7000 #step is one minibatch, so train for 7000 minibatches

print("How many epochs:", (TRAIN_STEPS*BATCH_SIZE)/len(train_data))

!onmt_train -data preprocessed-data -save_model trained-model -gpu_ranks 0 -learning_rate {LEARNING_RATE} -batch_size {BATCH_SIZE} -optim {OPTIMIZER} -train_steps {TRAIN_STEPS} -save_checkpoint_steps {TRAIN_STEPS} -encoder_type {ENCODER}

[2020-04-21 08:25:50,799 INFO] Extracting features...
[2020-04-21 08:25:50,800 INFO]  * number of source features: 0.
[2020-04-21 08:25:50,800 INFO]  * number of target features: 0.
[2020-04-21 08:25:50,800 INFO] Building `Fields` object...
[2020-04-21 08:25:50,800 INFO] Building & saving training data...
[2020-04-21 08:25:50,976 INFO] Building shard 0.
[2020-04-21 08:25:56,754 INFO]  * saving 0th train data shard to preprocessed-data.train.0.pt.
[2020-04-21 08:26:00,047 INFO]  * tgt vocab size: 115.
[2020-04-21 08:26:00,047 INFO]  * src vocab size: 217.
[2020-04-21 08:26:00,051 INFO] Building & saving validation data...
[2020-04-21 08:26:00,085 INFO] Building shard 0.
[2020-04-21 08:26:00,435 INFO]  * saving 0th valid data shard to preprocessed-data.valid.0.pt.
How many epochs: 2.751572327044025
[2020-04-21 08:26:03,823 INFO]  * src vocab size = 217
[2020-04-21 08:26:03,823 INFO]  * tgt vocab size = 115
[2020-04-21 08:26:03,823 INFO] Building model...
[2020-04-21 08:26:06,205 INFO] NM

![Lemmatizer model](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/raw/master/figs/lemmatizer-model.png)

In [0]:
# predict 
!cat dev.input | head -20 > small_test.input ; onmt_translate -model trained-model_step_7000.pt -src small_test.input -output pred.txt -replace_unk ; paste -d"\t" small_test.input pred.txt

[2020-04-20 21:15:38,406 INFO] Translating shard 0.
PRED AVG SCORE: -0.0221, PRED PPL: 1.0223
T h e PROPN Case=Nom Number=Sing	T h e
G a r d e n PROPN Case=Nom Number=Sing	G a r d e n
C o l l e c t i o n PROPN Case=Nom Number=Sing	C o l l e c t i o n
b y PROPN Case=Nom Number=Sing	b y
H & M PROPN Abbr=Yes Case=Nom Number=Sing	H & M
V i i k o n l o p u n NOUN Case=Gen Derivation=U Number=Sing	v i i k o n # l o p p u
p y ö r i t y s NOUN Case=Nom Number=Sing	p y ö r i t y s
a l k o i VERB Mood=Ind Number=Sing Person=3 Tense=Past VerbForm=Fin Voice=Act	a l k a a
H & M : n PROPN Abbr=Yes Case=Gen Number=Sing	H & M
j ä r j e s t ä m ä l l ä VERB Case=Ade Degree=Pos Number=Sing PartForm=Agt VerbForm=Part Voice=Act	j ä r j e s t ä ä
b l o g g a a j a b r u n s s i l l a NOUN Case=Ade Number=Sing	b l o g g a a j a # b r u n s s i
H e l s i n g i s s ä PROPN Case=Ine Number=Sing	H e l s i n k i
. PUNCT _	.
S h o w r o o m i l l a PROPN Case=Ade Number=Sing	S h o w r o o m i
e s i t e l t i i n 

In [0]:
# ambiguous words
!echo "k o i r a s t a NOUN Case=Ela Number=Sing" > tmp.tmp ; onmt_translate -model trained-model_step_7000.pt -src tmp.tmp -output pred.txt -replace_unk ; cat pred.txt | perl -pe 's/ //g'
!echo "k o i r a s t a NOUN Case=Par Number=Sing" > tmp.tmp ; onmt_translate -model trained-model_step_7000.pt -src tmp.tmp -output pred.txt -replace_unk ; cat pred.txt | perl -pe 's/ //g'

[2020-04-20 21:15:42,854 INFO] Translating shard 0.
PRED AVG SCORE: -0.0007, PRED PPL: 1.0007
koira
[2020-04-20 21:15:47,637 INFO] Translating shard 0.
PRED AVG SCORE: -0.0073, PRED PPL: 1.0073
koiras


## Fully trained models available at https://turkunlp.org/Turku-neural-parser-pipeline/
  - Includes trained models for segmentation, part-of-speech and morphological tagging, lemmatization, and syntactic parsing
  - Over 50 languages supported
  - Finnish lemmatization accuracy: ~95%
  - Demo: http://bionlp-www.utu.fi/parser_demo/

# Word inflection model
- The model can also be trained 'the other way around'
    - Generate the inflected word from the given lemma and desired inflection (for example Case information)
    - (Very handy for a language learner! 😉)

- We just need to modify the data preparation code
  - Input: lemma + morphological features
  - Output: inflected word

- Used in search engines, document classification



In [11]:
def prepare_inflection_dataset(fname):
  # create input and output sequences (in text format)
  data = []
  with open(fname, "rt", encoding="utf-8") as f:
    for comments, sent in read_conllu(f):
      for token in sent:
        if "-" in token[ID] or "." in token[ID]: # multiword token or null node --> skip
          continue
        if token[UPOS] != "NOUN": # let's take only nouns to fasten training
          continue
        input_chars = " ".join(c for c in token[LEMMA]) # input is lemma + morphological features
        features = " ".join([token[UPOS]] + token[FEATS].split("|")) # add morphological features
        input_chars = input_chars + " " + features
        output_chars = " ".join(c for c in token[FORM])
        data.append((input_chars, output_chars))
  return data


# transform data
train_data = prepare_inflection_dataset("fi_tdt-ud-train.conllu")
import random
random.shuffle(train_data)

dev_data = prepare_inflection_dataset("fi_tdt-ud-dev.conllu")
print("First train examples:", train_data[:5])
print("Number of training examples:", len(train_data))
print("\nNumber of development examples:", len(dev_data))

save_data(train_data, dev_data)

First train examples: [('h a k i j a # m ä ä r ä NOUN Case=Ela Number=Sing', 'h a k i j a m ä ä r ä s t ä'), ('l o s o # p e r s e NOUN Case=Nom Number=Sing', 'l o s o p e r s e'), ('t o i m e e n # t u l o NOUN Case=Nom Number=Sing', 't o i m e e n t u l o'), ('i n n o s t u s NOUN Case=Nom Number=Sing', 'i n n o s t u s'), ('p i k k u # l i n t u NOUN Case=Nom Number=Sing', 'p i k k u l i n t u')]
Number of training examples: 45588

Number of development examples: 5103


In [12]:
# run preprocessing, this handles vectorization etc.
!onmt_preprocess -train_src train.input -train_tgt train.output -valid_src dev.input -valid_tgt dev.output -save_data preprocessed-data -src_words_min_frequency 5 -tgt_words_min_frequency 5 -overwrite

# train model

BATCH_SIZE = 64
LEARNING_RATE = 0.001
OPTIMIZER = "adam"
ENCODER = "brnn"
TRAIN_STEPS = 5000 #step is one minibatch, so train for 5000 minibatches

print("How many epochs:", (TRAIN_STEPS*BATCH_SIZE)/len(train_data))

!onmt_train -data preprocessed-data -save_model trained-model -gpu_ranks 0 -learning_rate {LEARNING_RATE} -batch_size {BATCH_SIZE} -optim {OPTIMIZER} -train_steps {TRAIN_STEPS} -save_checkpoint_steps {TRAIN_STEPS} -encoder_type {ENCODER}

[2020-04-21 08:38:20,586 INFO] Extracting features...
[2020-04-21 08:38:20,586 INFO]  * number of source features: 0.
[2020-04-21 08:38:20,586 INFO]  * number of target features: 0.
[2020-04-21 08:38:20,586 INFO] Building `Fields` object...
[2020-04-21 08:38:20,586 INFO] Building & saving training data...
[2020-04-21 08:38:20,645 INFO] Building shard 0.
[2020-04-21 08:38:22,120 INFO]  * saving 0th train data shard to preprocessed-data.train.0.pt.
[2020-04-21 08:38:23,201 INFO]  * tgt vocab size: 76.
[2020-04-21 08:38:23,202 INFO]  * src vocab size: 112.
[2020-04-21 08:38:23,205 INFO] Building & saving validation data...
[2020-04-21 08:38:23,225 INFO] Building shard 0.
[2020-04-21 08:38:23,332 INFO]  * saving 0th valid data shard to preprocessed-data.valid.0.pt.
How many epochs: 7.019391067824866
[2020-04-21 08:38:26,478 INFO]  * src vocab size = 112
[2020-04-21 08:38:26,478 INFO]  * tgt vocab size = 76
[2020-04-21 08:38:26,479 INFO] Building model...
[2020-04-21 08:38:28,836 INFO] NMTM

## Predict Inflections

In [0]:
!echo "k i s s a NOUN Case=Ade Number=Sing" > tmp.tmp ; onmt_translate -model trained-model_step_5000.pt -src tmp.tmp -output pred.txt ; cat pred.txt | perl -pe 's/ //g'
!echo "t a l o NOUN Case=Ine Number=Sing" > tmp.tmp ; onmt_translate -model trained-model_step_5000.pt -src tmp.tmp -output pred.txt ; cat pred.txt | perl -pe 's/ //g'
!echo "t u l p p a a n i NOUN Case=Par Number=Plur" > tmp.tmp ; onmt_translate -model trained-model_step_5000.pt -src tmp.tmp -output pred.txt ; cat pred.txt | perl -pe 's/ //g'
!echo "t u l p p a a n i NOUN Case=Par Clitic=Kin Number=Plur" > tmp.tmp ; onmt_translate -model trained-model_step_5000.pt -src tmp.tmp -output pred.txt ; cat pred.txt | perl -pe 's/ //g'

[2020-04-20 21:20:25,975 INFO] Translating shard 0.
PRED AVG SCORE: -0.0068, PRED PPL: 1.0069
kissalla
[2020-04-20 21:20:29,071 INFO] Translating shard 0.
PRED AVG SCORE: -0.0034, PRED PPL: 1.0034
talossa
[2020-04-20 21:20:33,170 INFO] Translating shard 0.
PRED AVG SCORE: -0.0124, PRED PPL: 1.0124
tulppaaneja
[2020-04-20 21:20:37,771 INFO] Translating shard 0.
PRED AVG SCORE: -0.0315, PRED PPL: 1.0320
tulppaanejakin
