## Imports

In [22]:
from transformers import AutoTokenizer, AutoModelForCausalLM,\
                         TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset

In [42]:
def sample_data(dataset):
    for i in range(5):
        print(dataset[i])

## Load the Dataset

In [35]:
ds = load_dataset("shuttie/dadjokes")
train_data = ds["train"]
test_data = ds["test"]

## Preprocess the data
<p>GPT-like models expect casual language modeling as all-in-one input containing all the text at once, since our dataset has questions separated from their responses, so we need to combine them as follows:</p>

In [36]:
def combine_fields(record):
    q = record.get("question")
    r = record.get("response")
    if q is None or r is None:
        return {"text": ""}
    return {"text": q.strip() + " " + r.strip()}

In [None]:
train_data = train_data.map(combine_fields)
test_data = test_data.map(combine_fields)

train_data = train_data.remove_columns(["question","response"])
test_data = test_data.remove_columns(["question","response"])

sample_data(train_data)

{'text': 'I asked my priest how he gets holy water He said it’s just regular water, he just boils the hell out of it'}
{'text': 'Life Hack: If you play My Chemical Romance loud enough in your yard your grass will cut itself'}
{'text': 'OMG. SISTERS. JAMES. CHARLES. IS. DOING. A GIVEAWAY his career'}
{'text': 'Why did Mr.  Potato Head get pulled over He was baked'}
{'text': 'On zombie cravings.  My kids and i had some fun with these on a car trip this past weekend.   What do zombie plumbers crave.  Draaaaains.   What do zombie pilots crave.  Planes.  Plaaaanes.   What do zombie conductors crave.  Traaaains.   What do zombie opthalmologists crave.  Fraaames.   What do zombie construction workers crave.  Craaanes.   What do zombie nurses crave.  Paaains.   What do vampires crave Blood'}


## Tokenization
<p>is the process of breaking down documents or inputs of text into small units called <strong>tokens</strong>, in order to make it easier for the machine to process the text.</p>

In [38]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token



### Example of a sentence after tokenization

In [39]:
tokenizer("Why did the chicken cross the road?", padding="max_length", max_length=10)

{'input_ids': [5195, 750, 262, 9015, 3272, 262, 2975, 30, 50256, 50256], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]}

In [44]:
def tokenize_function(record):
    return tokenizer(record["text"], truncation=True, padding="max_length", max_length=64)

tokenized_data = train_data.map(tokenize_function, batched=True)

sample_data(tokenized_data)

Map: 100%|██████████| 52000/52000 [00:01<00:00, 33896.20 examples/s]

{'text': 'I asked my priest how he gets holy water He said it’s just regular water, he just boils the hell out of it', 'input_ids': [40, 1965, 616, 11503, 703, 339, 3011, 11386, 1660, 679, 531, 340, 447, 247, 82, 655, 3218, 1660, 11, 339, 655, 40169, 262, 5968, 503, 286, 340, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
{'text': 'Life Hack: If you play My Chemical Romance loud enough in your yard your grass will cut itself', 'input_ids': [14662, 18281, 25, 1002, 345, 711, 2011, 24872, 36555, 7812, 1576, 287, 534, 12699, 534, 8701, 481, 2005, 2346, 50256, 50256, 50256




## Training the Model

In [None]:
model = AutoModelForCausalLM.from_pretrained("distilgpt2")
training_arguments = TrainingArguments(
    output_dir='./joke-model',
    num_train_epochs=5,
    # per_device_eval_batch_size=8, if we add evaluation later or split the dataset that we have
    logging_steps=100,
    save_steps=1000,
    save_total_limit=1,
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=tokenized_data,
    data_collator=data_collator
)

  _torch_pytree._register_pytree_node(


In [8]:
# trainer.train()

## Trying the model (Before vs. After)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token 

prompt = "Why did the chicken cross the road?"

original_model = AutoModelForCausalLM.from_pretrained("distilgpt2")
inputs = tokenizer(prompt, return_tensors="pt")
original_output = original_model.generate(**inputs, max_length=64)
original_text = tokenizer.decode(original_output[0], skip_special_tokens=True)


fine_tuned_model = AutoModelForCausalLM.from_pretrained("./joke-model/checkpoint-65000")
fine_tuned_output = fine_tuned_model.generate(**inputs, max_length=64)
fine_tuned_text = tokenizer.decode(fine_tuned_output[0], skip_special_tokens=True)

print(" Before Fine-tuning:\n", original_text)
print("\n After Fine-tuning:\n", fine_tuned_text)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 Before Fine-tuning:
 Why did the chicken cross the road?”









































 After Fine-tuning:
 Why did the chicken cross the road?  To get to the other side of the road.                               


### Explanation of the output above:
<ul>
<i>Before fine-tuning:<br>
The original pre-trained model (distilgpt2) is not tuned to generate jokes, so instead it fills the rest with eos padding, that's why it's all empty. Basically, GPT-2 didn’t know what to say. It padded silence because it's in its pretrained state.</i>
<br>
<br>
<i>After fine-tuning:<br>
After using the jokes dataset, the model became able to understand the joke prompt and then generate a proper answer as a continue or punchline, thanks to fine-tuning it on the jokes.</i>
</ul>

## Evaluation
<p>Evaluating the model using the test set (unseen data), but first we need to tokenize it.</p>

In [61]:
tokenized_test_data = test_data.map(tokenize_function, batched=True)

Map: 100%|██████████| 1400/1400 [00:00<00:00, 24373.44 examples/s]


In [None]:
model = AutoModelForCausalLM.from_pretrained("./joke-model/checkpoint-65000/")

training_arguments = TrainingArguments(
    output_dir='./joke-model',
    per_device_eval_batch_size=8,
)

trainer = Trainer(
    model=model,
    args=training_arguments,
    eval_dataset=tokenized_test_data,
    data_collator=data_collator
)

eval_results = trainer.evaluate()
print(eval_results)

100%|██████████| 175/175 [00:02<00:00, 76.17it/s]

{'eval_loss': 3.435093402862549, 'eval_runtime': 2.3533, 'eval_samples_per_second': 594.913, 'eval_steps_per_second': 74.364}





<p>These results make perfect sense, while a loss of 3.435 is usually considered very high, but since we're tuning the model on a small dataset these results are considered decent. A training set of 52K record of jokes is a very small dataset in the world of NLP, and here's a demonstration that shows how small this dataset is:</p>
<p>Assuming ~20 tokens per joke → ~1 million tokens for the whole dataset.
That’s ~0.0125% of what GPT-2 was trained on which is <a href="https://lambda.ai/blog/demystifying-gpt-3">~40GB of Internet text (8 million documents) which is ~10 billion tokens</a> 😅.</p>

## GPU detection

In [67]:
import torch
print(torch.cuda.is_available())# True means that there is at least one GPU|TPU.
print(torch.cuda.device_count())# Shows how many devices (GPUs | TPUs) are there.

True
1
