<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_seq2seq_lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The task

* Lemmatization
* Input: wordform + morpho information
* Output: word baseform
* Easy for English, but not so much for Finnish or many other languages

Here is few examples:

* dogs+NOUN+Plural -> dog
* sheep+NOUN+Plural -> sheep
* voi+VERB+... -> voida
* voi+NOUN+Singular -> voi

# Data preparation

* We can use universaldependencies.org
* Collection of treebanks
* Pick your favorite language, I will use Finnish

In [1]:
!pip3 install --quiet datasets transformers

[K     |████████████████████████████████| 325 kB 26.0 MB/s 
[K     |████████████████████████████████| 4.0 MB 53.8 MB/s 
[K     |████████████████████████████████| 77 kB 5.8 MB/s 
[K     |████████████████████████████████| 212 kB 61.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 42.3 MB/s 
[K     |████████████████████████████████| 136 kB 59.3 MB/s 
[K     |████████████████████████████████| 127 kB 67.3 MB/s 
[K     |████████████████████████████████| 895 kB 49.7 MB/s 
[K     |████████████████████████████████| 596 kB 51.1 MB/s 
[K     |████████████████████████████████| 6.5 MB 30.6 MB/s 
[K     |████████████████████████████████| 271 kB 52.1 MB/s 
[K     |████████████████████████████████| 94 kB 3.5 MB/s 
[K     |████████████████████████████████| 144 kB 66.1 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires fol

You can use e.g. UD_English-EWT for English or any other language you want from UniversalDependencies

In [2]:
!wget -O train.conllu https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-train.conllu
!wget -O validation.conllu https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-dev.conllu
!wget -O test.conllu https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-test.conllu

--2022-04-11 08:28:14--  https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-train.conllu
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/UniversalDependencies/UD_Finnish-TDT/master/fi_tdt-ud-train.conllu [following]
--2022-04-11 08:28:14--  https://raw.githubusercontent.com/UniversalDependencies/UD_Finnish-TDT/master/fi_tdt-ud-train.conllu
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13443822 (13M) [text/plain]
Saving to: ‘train.conllu’


2022-04-11 08:28:14 (315 MB/s) - ‘train.conllu’ saved [13443822/13443822]

--2022-04-11 08:28:14--  https://github.com/UniversalDe

# Data preparation

* The CoNLL format should be familiar to you by now
* Here is few lines (the delimiter is TAB)



```
# newdoc id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200
# sent_id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-0001
# newpar id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-p0001
# text = What if Google Morphed Into GoogleOS?
1	What	what	PRON	WP	PronType=Int	0	root	0:root	_
2	if	if	SCONJ	IN	_	4	mark	4:mark	_
3	Google	Google	PROPN	NNP	Number=Sing	4	nsubj	4:nsubj	_
4	Morphed	morph	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	1	advcl	1:advcl:if	_
5	Into	into	ADP	IN	_	6	case	6:case	_
6	GoogleOS	GoogleOS	PROPN	NNP	Number=Sing	4	obl	4:obl:into	SpaceAfter=No
7	?	?	PUNCT	.	_	4	punct	4:punct	_


```

* Let us form training examples like so:
    * Input is `wordform`_`POS`_`FEATS`
    * Output is the lemma
* We can reuse part of our dataset preparation code from the [MLP notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_mlp.ipynb)


In [3]:
import json

In [4]:
ID,FORM,LEMMA,UPOS,XPOS,FEAT,HEAD,DEPREL,DEPS,MISC=range(10)

We now want to formulate the examples such that input is the word and all morphological information, output is the lemma

```
IN: Morphed+++VERB|Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin

OUT: morph
```

In [5]:
def yield_examples(fname,uniq=True):
    """
    uniq: do unique on the words, not to have duplicated examples for punctuation and stuff
    """
    with open(fname) as f:
        seen=set()
        for line in f:
            line=line.rstrip("\n")
            if not line or line.startswith("#"): #empty and comment lines: skip
                continue
            cols=line.split("\t")
            if not cols[0].isnumeric(): #lines which are not a real word: skip
                continue
            #form the example pair:
            #   IN: wordform+++POSTAG|all other tags
            #  OUT: lemma
            form_tags,lemma=cols[FORM]+"+++"+cols[UPOS]+"|"+cols[FEAT],cols[LEMMA]
            if uniq:
                if (form_tags,lemma) in seen:
                    continue
                seen.add((form_tags,lemma))
            #and here is the example
            yield {"form_tags":form_tags,"lemma":lemma}

* turn every `.conllu` into the corresponding `.jsonl` with the examples
* that way we can then easily load it as a dataset and train a model



In [6]:
for fname in ("train.conllu","validation.conllu","test.conllu"):
    with open(fname.replace(".conllu",".jsonl"),"wt") as f_out:
        for example in yield_examples(fname):
            print(json.dumps(example,ensure_ascii=False,sort_keys=True),file=f_out)

## Load as dataset

* This is a slight modification of the loading code we've been using throughout the course


In [7]:
import datasets
dataset = datasets.load_dataset(
    'json',                             # Format of the data
    data_files={"train":"train.jsonl","validation":"validation.jsonl","test":"test.jsonl"},
    split={
        "train":"train",
        "validation":"validation",
        "test":"test"
    },
    features=datasets.Features({    # Here we tell how to interpret the attributes
        "form_tags":datasets.Value("string"),
        "lemma":datasets.Value("string")
    })
)

Using custom data configuration default-f0a33713ad30d705


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-f0a33713ad30d705/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-f0a33713ad30d705/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
# this is always a good idea!
dataset=dataset.shuffle()

In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['form_tags', 'lemma'],
        num_rows: 51100
    })
    validation: Dataset({
        features: ['form_tags', 'lemma'],
        num_rows: 8662
    })
    test: Dataset({
        features: ['form_tags', 'lemma'],
        num_rows: 9399
    })
})

# Tokenize and prepare

* This is a bit more cpmplex than it might sound
* Let's stop to think; do we really want to tokenize this data in the usual manner?

In [12]:
import transformers

In [13]:
#OK, let's try with our trusty tokenizer
#but why would this work in the first place?
model_name = "TurkuNLP/bert-base-finnish-cased-v1"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/414k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/796k [00:00<?, ?B/s]

the examples are formed surprisingly similarly to what you've seen before:

* `input_ids` is the input side
* `attention_mask` is the input attention mask
* `labels` is the output ids
* the encoder-decoder model should take care of the rest


In [14]:
tokenizer.add_special_tokens({ "additional_special_tokens": [ "[unused1]", "[unused2]" ] })

def tokenize(example):
    # let's get the input word sepated from the tags
    inp_w,inp_tags=example["form_tags"].split("+++",1)
    out=" ".join(example["lemma"])
    
    # make sure you separate everything by space, the tokenizer will pick it up
    inp_tok=tokenizer("[unused1] "+" ".join(inp_w)+" "+(inp_tags.replace("|"," "))+" [unused2]",truncation=True)
    outp_tok=tokenizer("[unused1] "+out+" [unused2]",truncation=True)
    return {"input_ids":inp_tok["input_ids"],
            "attention_mask":inp_tok["attention_mask"],
            "labels":outp_tok["input_ids"]}

In [15]:
dataset=dataset.map(tokenize)

  0%|          | 0/51100 [00:00<?, ?ex/s]

  0%|          | 0/8662 [00:00<?, ?ex/s]

  0%|          | 0/9399 [00:00<?, ?ex/s]

In [19]:
print(" IN:",tokenizer.decode(dataset["train"][0]["input_ids"]))
print("OUT:",tokenizer.decode(dataset["train"][0]["labels"]))

 IN: [CLS] [unused1] u s k a l t a i s i VERB Mood = Cnd Number = Sing Person = 3 VerbForm = Fin Voice = Act [unused2] [SEP]
OUT: [CLS] [unused1] u s k a l t a a [unused2] [SEP]


# Encoder - Decoder model

* 

In [None]:
config_encoder = transformers.BertConfig(vocab_size=tokenizer.vocab_size,
                                         hidden_size=128,
                                         num_hidden_layers=4,
                                         num_attention_heads=4,
                                         )
config_decoder = transformers.BertConfig(vocab_size=tokenizer.vocab_size,
                                         hidden_size=128,
                                         num_hidden_layers=4,
                                         num_attention_heads=4,
                                         decoder_start_token_id=53)
config = transformers.EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
config.decoder_start_token_id=53
config.pad_token_id=0
model = transformers.EncoderDecoderModel(config=config)

help(model.forward)

In [None]:
collator=transformers.DataCollatorForSeq2Seq(tokenizer=tokenizer,
                                             model=model,
                                             padding=True,
                                             return_tensors="pt")

In [None]:
lst=[]
for e in dataset["train"]:
    lst.append({"input_ids":e["input_ids"],"labels":e["labels"],"attention_mask":e["attention_mask"]})
    break
batch=collator(lst)
batch

In [None]:
trainer_args = transformers.TrainingArguments(
    "checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    load_best_model_at_end=True,
    eval_steps=1000,
    logging_steps=100,
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    max_steps=30000,
    save_steps=1000
)

In [None]:
early_stopping = transformers.EarlyStoppingCallback(
    early_stopping_patience=5
)

In [None]:
trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    data_collator=collator,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer = tokenizer,
    callbacks=[early_stopping]
)

In [None]:
trainer.train()

In [None]:
trainer.model.save_pretrained("s2s_lemmatizer")