# Case Study: Train a Deep Haiku Generator
--- 


In [1]:
import os
os.chdir('..')
print(f'Setting working dir to: {os.getcwd()}')

Setting working dir to: /Users/ingomarquart/Documents/GitHub/itern-nlp-training-cases


## GPT-2 Fine-tuning for Haiku Generation

In this notebooks, we will start fine-tuning a GPT-style causal language model for Haiku generation.

In [2]:
import torch
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM
from transformers import DataCollatorForLanguageModeling
from transformers import AutoTokenizer

from datasets import load_dataset

### Exercise 1 - Data Loading

Use the `load_dataset` function from the `datasets` package to import the two files *haiku_data_1.txt* and *haiku_data_2.txt* from the *data* folder.
Check out the structure of the newly created object.   

The `Dataset` and `DatasetDict` classes come with a `.train_test_split()` method for splitting data into train and test sets.
Use it to create two datasets, one for train, the other for test.



In [3]:
test_proportion = 0.1

# Add your solution here:
# ...


### Exercise 2 - Pre-Processing

Next, we need to pre-process our data and namely tokenize it.

- Load the correct pre-trained tokenizer for your model

As you have probability noticed, the raw text files already contain an EOS-token in the form of "<|endoftext|>".   
- Check out, what kind of EOS-token the loaded model expects
- Change the EOS-token if necessary

Now comes the tricky part, we need to tokenize and chunck the data for the model training.   
- Define two separate functions, one for tokenization and one for chunking  
- Then apply both to the dataset using the `.map()` method
- Define a data collator using the `DataCollatorForLanguageModeling` class   
- Test the whole pipeline by drawing a sample for the tokenized and chunked dataset and feed it through the data collator for batching 

In [None]:
model_name = "gpt2"
block_size = 128

# Add your solution here:
# ...

#### Hints

- For the tokenizer function, you just need to tokenizer the entire dataset
- For the chunking function, concatenate the various inputs to a long list
- Afterwards, split the list using the block size defined below
- It should be fine to drop any reminder that does not fit into the last context windows
- For the label IDs, you can just copy the input IDs (the shifting will be handled later)

### Exercise 3 - Model Training

After pre-processing our data, we can now fine-tune our pre-trained GPT-2 model. We will use the `AutoModelForCausalLM` and `Trainer` classes from the 🤗 package to do so.  

- Load the pre-trained model and place it on the right device
- Define appropriate training arguments
- Define a (PyTorch) Trainer
- Start the training 🙃

In [None]:
!mkdir -p ../checkpoints
!mkdir -p ../logs

In [None]:
# Add your solution here:
# ...

### Exercise 4 - Generate from Model

Try to see how the model generates haiku from the given seed text. Make sure to also play around with the decoding strategies.

In [None]:
# Add your solution here:
# ...