# Case Study: Train a Deep Haiku Generator
--- 


In [1]:
import os
os.chdir('..')
print(f'Setting working dir to: {os.getcwd()}')

Setting working dir to: /Users/ingomarquart/Documents/GitHub/itern-nlp-training-cases


## GPT-2 Fine-tuning for Haiku Generation

In this notebooks, we will start fine-tuning a GPT-style causal language model for Haiku generation.

In [2]:
import torch
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM
from transformers import DataCollatorForLanguageModeling
from transformers import AutoTokenizer

from datasets import load_dataset

### Exercise 1 - Data Loading

Use the `load_dataset` function from the `datasets` package to import the two files *haiku_data_1.txt* and *haiku_data_2.txt* from the *data* folder.
Check out the structure of the newly created object.   

The `Dataset` and `DatasetDict` classes come with a `.train_test_split()` method for splitting data into train and test sets.
Use it to create two datasets, one for train, the other for test.



In [3]:
test_proportion = 0.1

# Add your solution here:
# ...


#### Solution

In [5]:
# Load the dataset using load_dataset and the "text" argument
files = ['data/haiku_data_1.txt', 'data/haiku_data_2.txt']
dataset_raw = load_dataset('text', data_files=files)
dataset_raw = dataset_raw.filter(lambda x: x['text'] != '')

print('After loading the dataset, it looks like:')
print(dataset_raw)

# Apply the train_test_split method to create two new datasets
dataset_raw = dataset_raw['train'].train_test_split(test_proportion)

print('After splitting the dataset, it  looks like:')
print(dataset_raw)

# Slice a few samples and take a look at it
print('The training set looks like:')
print(dataset_raw['train'][0:10])

Using custom data configuration default-b0395c5cfc92a962


Downloading and preparing dataset text/default to /Users/ingomarquart/.cache/huggingface/datasets/text/default-b0395c5cfc92a962/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset text downloaded and prepared to /Users/ingomarquart/.cache/huggingface/datasets/text/default-b0395c5cfc92a962/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/112 [00:00<?, ?ba/s]

After loading the dataset, it looks like:
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 89119
    })
})
After splitting the dataset, it  looks like:
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 80207
    })
    test: Dataset({
        features: ['text'],
        num_rows: 8912
    })
})
The training set looks like:
{'text': ['none of the students', 'how to tell', '<|endoftext|>', 'gone to seed', 'the sea', 'spring morning', 'the cactus flower', 'memory betrays', '<|endoftext|>', 'on a foggy window']}


### Exercise 2 - Pre-Processing

Next, we need to pre-process our data and namely tokenize it.

- Load the correct pre-trained tokenizer for your model

As you have probability noticed, the raw text files already contain an EOS-token in the form of "<|endoftext|>".   
- Check out, what kind of EOS-token the loaded model expects
- Change the EOS-token if necessary

Now comes the tricky part, we need to tokenize and chunck the data for the model training.   
- Define two separate functions, one for tokenization and one for chunking  
- Then apply both to the dataset using the `.map()` method
- Define a data collator using the `DataCollatorForLanguageModeling` class   
- Test the whole pipeline by drawing a sample for the tokenized and chunked dataset and feed it through the data collator for batching 

In [None]:
model_name = "gpt2"
block_size = 128

# Add your solution here:
# ...

#### Hints

- For the tokenizer function, you just need to tokenizer the entire dataset
- For the chunking function, concatenate the various inputs to a long list
- Afterwards, split the list using the block size defined below
- It should be fine to drop any reminder that does not fit into the last context windows
- For the label IDs, you can just copy the input IDs (the shifting will be handled later)

#### Solution

In [None]:
# Load the pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
print('Context length:', tokenizer.model_max_length)

# Check EOS token
print(f'Model special tokens:\n {tokenizer.special_tokens_map}')

# We will use two separat functions for tokenization and chunking

# Function to tokenize the text
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True)

# Function to chunk the text
def chunk_function(examples):
    
    # Concatenate all texts via sum of lists
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    
    # We drop the small remainder, we could add padding if the model supported 
    # it instead of this drop, you can customize this part to your needs
    total_length = (total_length // block_size) * block_size
    
    # Split by chunks of max_len size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    
    # Later DataCollatorForLanguageModeling will handling the input, therefore 
    # just copy inputs to labels
    result['labels'] = result['input_ids'].copy()
    
    return result

# Apply tokenize and chunking to texts in the dataset
# batched=True for faster computation 
# remove_columns because we don't need the raw text for training anymore
dataset_tokenized = dataset_raw.map(tokenize_function, 
                                    batched=True, 
                                    remove_columns=['text'])

dataset_lm = dataset_tokenized.map(chunk_function,
                                   batched=True)

# Create the DataCollator for causal language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Setting the pad_token to the eos_token is the default for GPT-2
tokenizer.pad_token = tokenizer.eos_token

# Test what we have ceated
# First we draw a single sample from the created dataset
sample = dataset_lm['train'][0]
print(f'First 5 elements of a sample :\n {sample["input_ids"][0:5]}')

# Then we batch the example (data_collator expects a list of dicts)
batch = data_collator([sample])

# And take a look of what comes out
print(f'Batch has {batch.keys()}')
print(f'input_ids = {batch["input_ids"].shape} \n \
        attention_mask = {batch["attention_mask"].shape} \n \
        labels = {batch["labels"].shape}')

print(batch['input_ids'][0, :5].detach().cpu().numpy())
print(batch['labels'][0, :5].detach().cpu().numpy())

### Exercise 3 - Model Training

After pre-processing our data, we can now fine-tune our pre-trained GPT-2 model. We will use the `AutoModelForCausalLM` and `Trainer` classes from the 🤗 package to do so.  

- Load the pre-trained model and place it on the right device
- Define appropriate training arguments
- Define a (PyTorch) Trainer
- Start the training 🙃

In [None]:
!mkdir -p ../checkpoints
!mkdir -p ../logs

In [None]:
# Add your solution here:
# ...

#### Solution

In [None]:
# Folder for checkpoints / logs
output_dir = f'../checkpoints/{model_name}'
log_dir = f'../logs/{model_name}'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
if not os.path.exists(log_dir):
    os.makedirs(log_dir)

# Load model an place on appropriate device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

# Define the training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    logging_dir=log_dir,
    num_train_epochs=3,
    # max_steps=150000, use as alternative to epochs, 
    # but make sure to change evaluation_strategy='steps' 
    # and eval_steps=1000 accordingly
    warmup_steps=1000,
    evaluation_strategy='epoch',
    learning_rate=5e-4,
    weight_decay=0.1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=1,
)

# Define trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_lm['train'],
    eval_dataset=dataset_lm['test'],
    data_collator=data_collator,
)

# Start the training and save model after training
trainer.train()
trainer.save_model()

### Exercise 4 - Generate from Model

Try to see how the model generates haiku from the given seed text. Make sure to also play around with the decoding strategies.

In [None]:
# Add your solution here:
# ...

#### Solution

In [None]:
# Promot for model input
new_input_txt = 'Deep learning'

# Create tokens
input_ids = tokenizer.encode(new_input_txt, return_tensors='pt').to(device)

# Run forwardpass
with torch.no_grad():
    output_ids = model.generate(input_ids, max_length=25)

# Decode the output
tokenizer.decode(output_ids[0].detach().cpu().numpy())