<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_seq2seq_lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The task

* Lemmatization
* Input: wordform + morpho information
* Output: word baseform
* Easy for English, but not so much for Finnish or many other languages

Here is few examples:

* dogs+NOUN+Plural -> dog
* sheep+NOUN+Plural -> sheep
* voi+VERB+... -> voida
* voi+NOUN+Singular -> voi

# Data preparation

* We can use universaldependencies.org
* Collection of treebanks
* Pick your favorite language, I will use Finnish

In [1]:
!pip3 install --quiet datasets transformers

#You can use UD_English-EWT for English

!wget -O train.conllu https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-train.conllu
!wget -O validation.conllu https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-dev.conllu
!wget -O test.conllu https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-test.conllu

[K     |████████████████████████████████| 325 kB 4.3 MB/s 
[K     |████████████████████████████████| 4.0 MB 33.8 MB/s 
[K     |████████████████████████████████| 77 kB 3.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 46.8 MB/s 
[K     |████████████████████████████████| 212 kB 54.2 MB/s 
[K     |████████████████████████████████| 136 kB 52.1 MB/s 
[K     |████████████████████████████████| 127 kB 50.5 MB/s 
[K     |████████████████████████████████| 895 kB 47.8 MB/s 
[K     |████████████████████████████████| 596 kB 46.7 MB/s 
[K     |████████████████████████████████| 6.5 MB 24.5 MB/s 
[K     |████████████████████████████████| 271 kB 47.5 MB/s 
[K     |████████████████████████████████| 94 kB 3.5 MB/s 
[K     |████████████████████████████████| 144 kB 25.6 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires foli

# Data preparation

* The CoNLL format should be familiar to you by now
* Here is few lines (the delimiter is TAB)



```
# newdoc id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200
# sent_id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-0001
# newpar id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-p0001
# text = What if Google Morphed Into GoogleOS?
1	What	what	PRON	WP	PronType=Int	0	root	0:root	_
2	if	if	SCONJ	IN	_	4	mark	4:mark	_
3	Google	Google	PROPN	NNP	Number=Sing	4	nsubj	4:nsubj	_
4	Morphed	morph	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	1	advcl	1:advcl:if	_
5	Into	into	ADP	IN	_	6	case	6:case	_
6	GoogleOS	GoogleOS	PROPN	NNP	Number=Sing	4	obl	4:obl:into	SpaceAfter=No
7	?	?	PUNCT	.	_	4	punct	4:punct	_


```



* Let us form training examples like so:
    * Input is `wordform`_`POS`_`FEATS`
    * Output is the lemma
* We can reuse part of our dataset preparation code from the [MLP notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_mlp.ipynb)

In [2]:
import json

ID,FORM,LEMMA,UPOS,XPOS,FEAT,HEAD,DEPREL,DEPS,MISC=range(10)

def yield_examples(fname):
    with open(fname) as f:
        for line in f:
            line=line.rstrip("\n")
            if not line or line.startswith("#"): #empty and comment lines: skip
                continue
            cols=line.split("\t")
            if not cols[0].isnumeric(): #lines which are not a real word: skip
                continue
            form_tags,lemma=cols[FORM]+"_"+cols[UPOS]+"_"+cols[FEAT],cols[LEMMA]
            #and here is the example
            yield {"form_tags":form_tags,"lemma":lemma}

#turn every .conllu into the corresponding .jsonl with the examples
for fname in ("train.conllu","validation.conllu","test.conllu"):
    with open(fname.replace(".conllu",".jsonl"),"wt") as f_out:
        for example in yield_examples(fname):
            print(json.dumps(example),file=f_out)



In [17]:
# Load the dataset, this is easy since we have
# each section in a separate file
import datasets

dataset = datasets.load_dataset(
    'json',                             # Format of the data
    data_files={"train":"train.jsonl","validation":"validation.jsonl","test":"test.jsonl"},
    split={
        "train":"train",
        "validation":"validation",
        "test":"test"
    },
    features=datasets.Features({    # Here we tell how to interpret the attributes
        "form_tags":datasets.Value("string"),
        "lemma":datasets.Value("string")
    })
)

dataset=dataset.shuffle()
#that was easy!




  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['form_tags', 'lemma'],
        num_rows: 162816
    })
    validation: Dataset({
        features: ['form_tags', 'lemma'],
        num_rows: 18308
    })
    test: Dataset({
        features: ['form_tags', 'lemma'],
        num_rows: 21070
    })
})

# Tokenize and prep

In [5]:
import transformers

#OK, let's try with our trusty tokenizer
#but why would this work in the first place?
model_name = "TurkuNLP/bert-base-finnish-cased-v1"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)



Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/414k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/796k [00:00<?, ?B/s]

the examples are formed surprisingly similarly to what you've seen before:

* `input_ids` is the input side
* `attention_mask` is the input attention mask
* `labels` is the output ids
* the encoder-decoder model should take care of the rest

In [7]:
def tokenize(example):
    inp_tok=tokenizer(example["form_tags"],truncation=True)
    outp_tok=tokenizer(example["lemma"],truncation=True)
    return {"input_ids":inp_tok["input_ids"],
            "attention_mask":inp_tok["attention_mask"],
            "labels":outp_tok["input_ids"]}

dataset=dataset.map(tokenize)

  0%|          | 0/162816 [00:00<?, ?ex/s]

  0%|          | 0/18308 [00:00<?, ?ex/s]

  0%|          | 0/21070 [00:00<?, ?ex/s]

In [8]:
config_encoder = transformers.BertConfig(vocab_size=tokenizer.vocab_size,
                                         hidden_size=128,
                                         num_hidden_layers=4,
                                         num_attention_heads=4,
                                         )
config_decoder = transformers.BertConfig(vocab_size=tokenizer.vocab_size,
                                         hidden_size=128,
                                         num_hidden_layers=4,
                                         num_attention_heads=4,
                                         decoder_start_token_id=55)
config = transformers.EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
config.decoder_start_token_id=53
config.pad_token_id=0
model = transformers.EncoderDecoderModel(config=config)



In [9]:
help(model.forward)

Help on method forward in module transformers.models.encoder_decoder.modeling_encoder_decoder:

forward(input_ids=None, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, encoder_outputs=None, past_key_values=None, inputs_embeds=None, decoder_inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, **kwargs) method of transformers.models.encoder_decoder.modeling_encoder_decoder.EncoderDecoderModel instance
    The [`EncoderDecoderModel`] forward method, overrides the `__call__` special method.
    
    <Tip>
    
    Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
    instance afterwards instead of this since the former takes care of running the pre and post processing steps while
    the latter silently ignores them.
    
    </Tip>
    
    Args:
        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            In

In [10]:
collator=transformers.DataCollatorForSeq2Seq(tokenizer=tokenizer,
                                             model=model,
                                             padding=True,
                                             return_tensors="pt")

lst=[]
for e in dataset["train"]:
    lst.append({"input_ids":e["input_ids"],"labels":e["labels"],"attention_mask":e["attention_mask"]})
    break
batch=collator(lst)
batch

{'input_ids': tensor([[  102, 21513, 39957,  1064,  3888, 28215, 15949,  3888, 25594, 50010,
          2199,  1174, 50017,   101, 22839,  5664,   139,  2199, 15718,   103]]), 'labels': tensor([[  102, 10666,   101, 10205,   103]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'decoder_input_ids': tensor([[   53,   102, 10666,   101, 10205]])}

In [15]:
trainer_args = transformers.TrainingArguments(
    "checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    load_best_model_at_end=True,
    eval_steps=1000,
    logging_steps=100,
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    max_steps=20000,
    save_steps=1000
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [16]:
trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    data_collator=collator,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer = tokenizer
)

trainer.train()

max_steps is given, it will override any value given in num_train_epochs
The following columns in the training set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: lemma, form_tags. If lemma, form_tags are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 162816
  Num Epochs = 4
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 20000


Step,Training Loss,Validation Loss
1000,3.1188,3.038699
2000,2.7188,2.617391
3000,2.4839,2.453938
4000,2.3665,2.339415
5000,2.239,2.241295
6000,2.1584,2.164905
7000,2.0962,2.103438
8000,1.9549,2.050257
9000,1.9597,1.997536
10000,1.9801,1.957424


The following columns in the evaluation set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: lemma, form_tags. If lemma, form_tags are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 18308
  Batch size = 32
Saving model checkpoint to checkpoints/checkpoint-1000
Configuration saved in checkpoints/checkpoint-1000/config.json
Model weights saved in checkpoints/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in checkpoints/checkpoint-1000/tokenizer_config.json
Special tokens file saved in checkpoints/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: lemma, form_tags. If lemma, form_tags are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 18308

TrainOutput(global_step=20000, training_loss=2.1528695945739744, metrics={'train_runtime': 2549.9506, 'train_samples_per_second': 250.985, 'train_steps_per_second': 7.843, 'total_flos': 1300375933911936.0, 'train_loss': 2.1528695945739744, 'epoch': 3.93})