<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_seq2seq_lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The task

* Lemmatization
* Input: wordform + morpho information
* Output: word baseform
* Easy for English, but not so much for Finnish or many other languages

Here is few examples:

* dogs+NOUN+Plural -> dog
* sheep+NOUN+Plural -> sheep
* voi+VERB+... -> voida
* voi+NOUN+Singular -> voi

# Data preparation

* We can use universaldependencies.org
* Collection of treebanks
* Pick your favorite language, I will use Finnish

In [21]:
!pip3 install --quiet datasets transformers

You can use e.g. UD_English-EWT for English or any other language you want from UniversalDependencies

In [22]:
!wget -q -O train.conllu https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-train.conllu
!wget -q -O validation.conllu https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-dev.conllu
!wget -q -O test.conllu https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-test.conllu

# Data preparation

* The CoNLL format should be familiar to you by now
* Here is few lines (the delimiter is TAB)



```
# newdoc id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200
# sent_id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-0001
# newpar id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-p0001
# text = What if Google Morphed Into GoogleOS?
1	What	what	PRON	WP	PronType=Int	0	root	0:root	_
2	if	if	SCONJ	IN	_	4	mark	4:mark	_
3	Google	Google	PROPN	NNP	Number=Sing	4	nsubj	4:nsubj	_
4	Morphed	morph	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	1	advcl	1:advcl:if	_
5	Into	into	ADP	IN	_	6	case	6:case	_
6	GoogleOS	GoogleOS	PROPN	NNP	Number=Sing	4	obl	4:obl:into	SpaceAfter=No
7	?	?	PUNCT	.	_	4	punct	4:punct	_


```

* Let us form training examples like so:
    * Input is `wordform`_`POS`_`FEATS`
    * Output is the lemma
* We can reuse part of our dataset preparation code from the [MLP notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_mlp.ipynb)


In [23]:
import json

In [24]:
ID,FORM,LEMMA,UPOS,XPOS,FEAT,HEAD,DEPREL,DEPS,MISC=range(10)

We now want to formulate the examples such that input is the word and all morphological information, output is the lemma

```
IN: Morphed+++VERB|Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin

OUT: morph
```

In [25]:
def yield_examples(fname,uniq=True):
    """
    uniq: do unique on the words, not to have duplicated examples for punctuation and stuff
    """
    with open(fname) as f:
        seen=set()
        for line in f:
            line=line.rstrip("\n")
            if not line or line.startswith("#"): #empty and comment lines: skip
                continue
            cols=line.split("\t")
            if not cols[0].isnumeric(): #lines which are not a real word: skip
                continue
            #form the example pair:
            #   IN: wordform+++POSTAG|all other tags
            #  OUT: lemma
            form_tags,lemma=cols[FORM]+"+++"+cols[UPOS]+"|"+cols[FEAT],cols[LEMMA]
            if uniq:
                if (form_tags,lemma) in seen:
                    continue
                seen.add((form_tags,lemma))
            #and here is the example
            yield {"form_tags":form_tags,"lemma":lemma}

* turn every `.conllu` into the corresponding `.jsonl` with the examples
* that way we can then easily load it as a dataset and train a model



In [26]:
for fname in ("train.conllu","validation.conllu","test.conllu"):
    with open(fname.replace(".conllu",".jsonl"),"wt") as f_out:
        for example in yield_examples(fname):
            print(json.dumps(example,ensure_ascii=False,sort_keys=True),file=f_out)

## Load as dataset

* This is a slight modification of the loading code we've been using throughout the course


In [27]:
import datasets
dataset = datasets.load_dataset(
    'json',                             # Format of the data
    data_files={"train":"train.jsonl","validation":"validation.jsonl","test":"test.jsonl"},
    split={
        "train":"train",
        "validation":"validation",
        "test":"test"
    },
    features=datasets.Features({    # Here we tell how to interpret the attributes
        "form_tags":datasets.Value("string"),
        "lemma":datasets.Value("string")
    })
)

Using custom data configuration default-9ddc4154c1e9b79f


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-9ddc4154c1e9b79f/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-9ddc4154c1e9b79f/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [28]:
# this is always a good idea as we have learned!
dataset=dataset.shuffle()

In [29]:
dataset

DatasetDict({
    train: Dataset({
        features: ['form_tags', 'lemma'],
        num_rows: 51100
    })
    validation: Dataset({
        features: ['form_tags', 'lemma'],
        num_rows: 8662
    })
    test: Dataset({
        features: ['form_tags', 'lemma'],
        num_rows: 9399
    })
})

# Tokenize and prepare

* This is a bit more cpmplex than it might sound
* Let's stop to think; do we really want to tokenize this data in the usual manner?

the examples are formed surprisingly similarly to what you've seen before:

* `input_ids` is the input side
* `attention_mask` is the input attention mask
* `labels` is the output ids
* the encoder-decoder model should (and hopefully does) take care of the rest
* it is a good idea to mark sequence start and end for the model both on the input and the output side
* we can tell the tokenizer to use `[unused1]` and `[unused2]` as the beginning/end of sequence tokens


In [30]:
import transformers

#OK, let's try with our trusty tokenizer
#but why would this work in the first place?
model_name = "TurkuNLP/bert-base-finnish-cased-v1"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name,bos_token="[unused1]",eos_token="[unused2]")

def tokenize(example):
    # let's get the input word separated from the tags
    inp_w,inp_tags=example["form_tags"].split("+++",1)
    out=" ".join(example["lemma"])
    
    # make sure you separate everything by space, the tokenizer will pick it up
    # below I print one of the input/output pairs so check that out
    inp_tok=tokenizer("[unused1]"+" "+" ".join(inp_w)+" "+(inp_tags.replace("|"," "))+" "+"[unused2]",truncation=True)
    outp_tok=tokenizer("[unused1]"+" "+out+" "+"[unused2]",truncation=True)
    return {"input_ids":inp_tok["input_ids"],
            "attention_mask":inp_tok["attention_mask"],
            "labels":outp_tok["input_ids"]}

loading configuration file https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/e27939251243299384d3c49756d6710f25a683fa4d5e00e6f42fe6cc59202f07.1b2c5b5f39fed7ac39db55c0d2566730a96257ac7215ad6c2a8a109e2ccf1ccd
Model config BertConfig {
  "_name_or_path": "TurkuNLP/bert-base-finnish-cased-v1",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.18.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 50105
}

loading file https://huggingface.co/TurkuNLP/bert-base-finnish-

In [31]:
dataset=dataset.map(tokenize)

  0%|          | 0/51100 [00:00<?, ?ex/s]

  0%|          | 0/8662 [00:00<?, ?ex/s]

  0%|          | 0/9399 [00:00<?, ?ex/s]

In [32]:
print(" IN:",tokenizer.convert_ids_to_tokens(dataset["train"][0]["input_ids"]))
print("OUT:",tokenizer.convert_ids_to_tokens(dataset["train"][0]["labels"]))

 IN: ['[CLS]', '[unused1]', 'E', 'i', 'j', 'a', 'PR', '##O', '##P', '##N', 'Cas', '##e', '=', 'No', '##m', 'Nu', '##mb', '##er', '=', 'Sing', '[unused2]', '[SEP]']
OUT: ['[CLS]', '[unused1]', 'E', 'i', 'j', 'a', '[unused2]', '[SEP]']


# Encoder - Decoder model

* We shall use a "vanilla" encoder-decoder model
* Luckily, it is still relatively easy
* Let us train a small model 128-long embeddings, 4 layers, 4 attention heads

In [45]:
config_encoder = transformers.BertConfig(vocab_size=tokenizer.vocab_size,
                                         hidden_size=128,
                                         num_hidden_layers=4,
                                         num_attention_heads=4,
                                         )
config_decoder = transformers.BertConfig(vocab_size=tokenizer.vocab_size,
                                         hidden_size=128,
                                         num_hidden_layers=4,
                                         num_attention_heads=4,
                                         decoder_start_token_id=53)
config = transformers.EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
config.decoder_start_token_id=53 #avoids an Error
config.pad_token_id=0            #avoids an Error
model = transformers.EncoderDecoderModel(config=config)

Set `config.is_decoder=True` and `config.add_cross_attention=True` for decoder_config


This is useful to run:

`help(model.forward)`

There is a Seq2Seq collator

In [34]:
#help(model.forward)

In [35]:
collator=transformers.DataCollatorForSeq2Seq(tokenizer=tokenizer,
                                             model=model,
                                             padding=True,
                                             return_tensors="pt")

In [46]:
trainer_args = transformers.Seq2SeqTrainingArguments(
    "checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    load_best_model_at_end=True,
    eval_steps=1000,
    logging_steps=100,
    learning_rate=1e-4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    max_steps=15000,
    save_steps=1000,
    predict_with_generate=True #this did take a while to figure out !
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [37]:
early_stopping = transformers.EarlyStoppingCallback(
    early_stopping_patience=3
)

In [47]:
trainer = transformers.Seq2SeqTrainer(
    model=model,
    args=trainer_args,
    data_collator=collator,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer = tokenizer,
    callbacks=[early_stopping]
)

max_steps is given, it will override any value given in num_train_epochs


In [48]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: form_tags, lemma. If form_tags, lemma are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 51100
  Num Epochs = 19
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 15000


Step,Training Loss,Validation Loss
1000,0.7514,0.581467
2000,0.341,0.268069
3000,0.2371,0.172558
4000,0.1879,0.126903
5000,0.1547,0.10696
6000,0.1328,0.094583
7000,0.1238,0.086479
8000,0.1129,0.083417
9000,0.099,0.071957
10000,0.1,0.068422


The following columns in the evaluation set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: form_tags, lemma. If form_tags, lemma are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8662
  Batch size = 64
Saving model checkpoint to checkpoints/checkpoint-1000
Configuration saved in checkpoints/checkpoint-1000/config.json
Model weights saved in checkpoints/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in checkpoints/checkpoint-1000/tokenizer_config.json
Special tokens file saved in checkpoints/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: form_tags, lemma. If form_tags, lemma are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8662
 

TrainOutput(global_step=15000, training_loss=0.3399748749732971, metrics={'train_runtime': 3817.3914, 'train_samples_per_second': 251.481, 'train_steps_per_second': 3.929, 'total_flos': 2329351282685472.0, 'train_loss': 0.3399748749732971, 'epoch': 18.77})

In [40]:
trainer.model.save_pretrained("s2s_lemmatizer")

Configuration saved in s2s_lemmatizer/config.json
Model weights saved in s2s_lemmatizer/pytorch_model.bin


In [50]:
dataset=dataset.shuffle()
test_data=dataset["test"].select(range(33))
predictions=trainer.predict(test_data)
for x,e in zip(predictions.predictions,test_data):
    print("------------------")
    print(">> ",e["form_tags"])
    print(tokenizer.decode(x))
    print()

The following columns in the test set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: form_tags, lemma. If form_tags, lemma are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 33
  Batch size = 64


------------------
>>  musta+++PRON|Case=Ela|Number=Sing|PronType=Prs|Style=Coll
[unused52] [CLS] [unused1] m i n ä [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] i n [unused2] [SEP] i

------------------
>>  hauska+++ADJ|Case=Nom|Degree=Pos|Number=Sing
[unused52] [CLS] [unused1] h a u s k a [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2]

------------------
>>  makean+++ADJ|Case=Gen|Degree=Pos|Number=Sing
[unused52] [CLS] [unused1] m a k e a [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2] [SEP]

------------------
>>  kokijaryhmiä+++NOUN|Case=Par|Number=Plur
[unused52] [CLS] [unused1] k o k i j a [UNK] r y h m ä [unused2] [SEP] [unused2] [SEP] [unused2]

------------------
>>  mm.+++ADV|Abbr=Yes
[unused52] [CLS] [unused1] m m. [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] [unused2] [SEP]

------------------
>>  km+++NOUN|Abbr=Yes|Case=P