<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_seq2seq_dates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The task

* Map various date formats into their standard form with a seq2seq model

# Data preparation

* The data is here: https://raw.githubusercontent.com/TurkuNLP/Deep_Learning_in_LangTech_course/master/data/generated_dates.txt

In [1]:
!pip3 install --quiet datasets transformers

[K     |████████████████████████████████| 325 kB 5.6 MB/s 
[K     |████████████████████████████████| 4.0 MB 36.1 MB/s 
[K     |████████████████████████████████| 1.1 MB 6.0 MB/s 
[K     |████████████████████████████████| 77 kB 4.0 MB/s 
[K     |████████████████████████████████| 212 kB 33.1 MB/s 
[K     |████████████████████████████████| 136 kB 22.3 MB/s 
[K     |████████████████████████████████| 127 kB 45.0 MB/s 
[K     |████████████████████████████████| 6.5 MB 31.0 MB/s 
[K     |████████████████████████████████| 596 kB 42.7 MB/s 
[K     |████████████████████████████████| 895 kB 43.4 MB/s 
[K     |████████████████████████████████| 94 kB 3.1 MB/s 
[K     |████████████████████████████████| 144 kB 45.2 MB/s 
[K     |████████████████████████████████| 271 kB 38.9 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires foliu

You can use e.g. UD_English-EWT for English or any other language you want from UniversalDependencies

In [2]:
!wget -q -O data.tsv https://raw.githubusercontent.com/TurkuNLP/Deep_Learning_in_LangTech_course/master/data/generated_dates.txt

# Data preparation

* The format is easy
* Here is few lines (the delimiter is TAB)



```
tammikuun 18. 1987	18.01.1987
joulukuun 26. 1993	26.12.1993
KESÄKUUN 16. 2009	16.06.2009
1997/8/7	07.08.1997
9. päivänä Heinäkuuta 1981	09.07.1981
1970.8.11	11.08.1970
1972.7.8	08.07.1972
1985/2/4	04.02.1985
1992/10/4	04.10.1992
HELMIKUUN 6. päivänä vuonna 2016	06.02.2016
Huhtikuun 28. 2014	28.04.2014
15.1.2003	15.01.2003
elokuun 25. päivänä 1998	25.08.1998
Tammikuun 19. päivänä vuonna 1977	19.01.1977
19.04.1995	19.04.1995
25.06.2010	25.06.2010
1998.02.21	21.02.1998
24.11.1977	24.11.1977
1986/4/25	25.04.1986
1998/04/03	03.04.1998
2007.8.15	15.08.2007
09/02/1982	09.02.1982
1977/05/31	31.05.1977
Toukokuun 13. 2000	13.05.2000
```

* We can reuse part of our dataset preparation code from the [MLP notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_mlp.ipynb)


In [6]:
import json

In [3]:
def yield_examples(fname,uniq=True):
    """
    uniq: do unique on the data
    """
    with open(fname) as f:
        seen=set()
        for line in f:
            line=line.rstrip("\n")
            if not line: #empty lines: skip
                continue
            cols=line.split("\t",1)
            inp,outp=cols
            if uniq:
                if (inp,outp) in seen:
                    continue
                seen.add((inp,outp))
            #and here is the example
            yield {"date":inp,"datenorm":outp}

* turn the `.tsv` into the corresponding `.jsonl` with the examples
* that way we can then easily load it as a dataset and train a model



In [9]:
for fname in ["data.tsv"]:
    with open(fname.replace(".tsv",".jsonl"),"wt") as f_out:
        for example in yield_examples(fname):
            print(json.dumps(example,ensure_ascii=False,sort_keys=True),file=f_out)

## Load as dataset

* This is a slight modification of the loading code we've been using throughout the course


In [11]:
import datasets
dataset = datasets.load_dataset(
    'json',                             # Format of the data
    data_files={"everything":"data.jsonl"},
    split={ #this we saw in the MLP notebook:
        "train":"everything[:80%]",
        "validation":"everything[80%:90%]",
        "test":"everything[90%:]"
    },
    features=datasets.Features({    # Here we tell how to interpret the attributes
        "date":datasets.Value("string"),
        "datenorm":datasets.Value("string")
    })
)

Using custom data configuration default-dec60ae059e4358c


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-dec60ae059e4358c/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-dec60ae059e4358c/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [12]:
# this is always a good idea as we have learned!
dataset=dataset.shuffle()

In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['date', 'datenorm'],
        num_rows: 67648
    })
    validation: Dataset({
        features: ['date', 'datenorm'],
        num_rows: 8456
    })
    test: Dataset({
        features: ['date', 'datenorm'],
        num_rows: 8456
    })
})

# Tokenize and prepare



the examples are formed surprisingly similarly to what you've seen before:

* `input_ids` is the input side
* `attention_mask` is the input attention mask
* `labels` is the output ids
* the encoder-decoder model should (and hopefully does) take care of the rest
* it is a good idea to mark sequence start and end for the model both on the input and the output side
* we can tell the tokenizer to use `[unused1]` and `[unused2]` as the beginning/end of sequence tokens


In [18]:
import transformers

#OK, let's try with our trusty tokenizer
model_name = "TurkuNLP/bert-base-finnish-cased-v1"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name,bos_token="[unused1]",eos_token="[unused2]")

def tokenize(example):
    # let's get the input word separated from the tags
    
    inp_tok=tokenizer("[unused1] "+example["date"]+" [unused2]",truncation=True)
    outp_tok=tokenizer("[unused1] "+example["datenorm"]+" [unused2]",truncation=True)
    return {"input_ids":inp_tok["input_ids"],
            "attention_mask":inp_tok["attention_mask"],
            "labels":outp_tok["input_ids"]}

In [19]:
dataset=dataset.map(tokenize)

  0%|          | 0/67648 [00:00<?, ?ex/s]

  0%|          | 0/8456 [00:00<?, ?ex/s]

  0%|          | 0/8456 [00:00<?, ?ex/s]

In [20]:
print(" IN:",tokenizer.convert_ids_to_tokens(dataset["train"][0]["input_ids"]))
print("OUT:",tokenizer.convert_ids_to_tokens(dataset["train"][0]["labels"]))

 IN: ['[CLS]', '[unused1]', '1983', '/', '03', '/', '22', '[unused2]', '[SEP]']
OUT: ['[CLS]', '[unused1]', '22', '.', '03', '.', '1983', '[unused2]', '[SEP]']


# Encoder - Decoder model

* We shall use a "vanilla" encoder-decoder model
* Luckily, it is still relatively easy
* Let us train a small model 128-long embeddings, 4 layers, 4 attention heads

In [21]:
config_encoder = transformers.BertConfig(vocab_size=tokenizer.vocab_size,
                                         hidden_size=64,
                                         num_hidden_layers=2,
                                         num_attention_heads=4,
                                         )
config_decoder = transformers.BertConfig(vocab_size=tokenizer.vocab_size,
                                         hidden_size=64,
                                         num_hidden_layers=2,
                                         num_attention_heads=4,
                                         decoder_start_token_id=53)
config = transformers.EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
config.decoder_start_token_id=53 #avoids an Error
config.pad_token_id=0            #avoids an Error
model = transformers.EncoderDecoderModel(config=config)

This is useful to run:

`help(model.forward)`

There is a Seq2Seq collator

In [None]:
#help(model.forward)

In [22]:
collator=transformers.DataCollatorForSeq2Seq(tokenizer=tokenizer,
                                             model=model,
                                             padding=True,
                                             return_tensors="pt")

In [23]:
trainer_args = transformers.Seq2SeqTrainingArguments(
    "checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    load_best_model_at_end=True,
    eval_steps=1000,
    logging_steps=100,
    learning_rate=1e-4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    max_steps=15000,
    save_steps=1000,
    predict_with_generate=True #this did take a while to figure out !
)

In [27]:
early_stopping = transformers.EarlyStoppingCallback(
    early_stopping_patience=3,
    early_stopping_threshold=0.001
)

In [28]:
trainer = transformers.Seq2SeqTrainer(
    model=model,
    args=trainer_args,
    data_collator=collator,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer = tokenizer,
    callbacks=[early_stopping]
)

max_steps is given, it will override any value given in num_train_epochs


In [29]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: datenorm, date. If datenorm, date are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 67648
  Num Epochs = 15
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 15000


Step,Training Loss,Validation Loss
1000,0.0003,0.000108
2000,0.0003,2.9e-05
3000,0.0001,1.4e-05
4000,0.0,1.2e-05


The following columns in the evaluation set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: datenorm, date. If datenorm, date are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8456
  Batch size = 64
Saving model checkpoint to checkpoints/checkpoint-1000
Configuration saved in checkpoints/checkpoint-1000/config.json
Model weights saved in checkpoints/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in checkpoints/checkpoint-1000/tokenizer_config.json
Special tokens file saved in checkpoints/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: datenorm, date. If datenorm, date are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8456
  Batch s

TrainOutput(global_step=4000, training_loss=0.00023687508929288014, metrics={'train_runtime': 322.4951, 'train_samples_per_second': 2976.79, 'train_steps_per_second': 46.512, 'total_flos': 36317307400704.0, 'train_loss': 0.00023687508929288014, 'epoch': 3.78})

In [31]:
dataset=dataset.shuffle()
test_data=dataset["test"].select(range(33))
predictions=trainer.predict(test_data)
for x,e in zip(predictions.predictions,test_data):
    print("------------------")
    print(">> ",e["date"])
    print(tokenizer.decode(x))
    print()

The following columns in the test set  don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: datenorm, date. If datenorm, date are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 33
  Batch size = 64


------------------
>>  helmikuun 14. 2001
[unused52] [CLS] [unused1] 14. 02. 2001 [unused2] [SEP]. 2001 [unused2] [SEP] [unused2] [SEP] [unused2] [SEP]. 2001

------------------
>>  30. päivänä TAMMIKUUTA 2004
[unused52] [CLS] [unused1] 30. 01. 2004 [unused2] [SEP]. 2004 [unused2] [SEP] [unused2] [SEP] [unused2] [SEP]. 2004

------------------
>>  helmikuun 6. vuonna 1983
[unused52] [CLS] [unused1] 06. 02. 1983 [unused2] [SEP] 06. 1983 [unused2] [SEP] 02. 1983 [unused2] [SEP]

------------------
>>  joulukuun 21. vuonna 2012
[unused52] [CLS] [unused1] 21. 12. 2012 [unused2] [SEP] 21. 2012 [unused2] [SEP] [unused2] [SEP] [unused2] [SEP] 21

------------------
>>  23/10/1977
[unused52] [CLS] [unused1] 23. 10. 1977 [unused2] [SEP]. 1977 [unused2] [SEP] [unused2] [SEP] [unused2] [SEP]. 1977

------------------
>>  1984/8/29
[unused52] [CLS] [unused1] 29. 08. 1984 [unused2] [SEP] 29. 1984 [unused2] [SEP] 29 [unused2] [SEP] 29.

------------------
>>  Kesäkuun 20. 2012
[unused52] [CLS] [unus