<a href="https://colab.research.google.com/github/RamonSaturninoM/GPT-Translation_Modeling/blob/main/language_model_jokes_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling
from transformers import pipeline, GPT2LMHeadModel, GPT2Tokenizer
from transformers import pipeline
import torch
import os

os.environ["WANDB_DISABLED"] = "true"

# Load tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Add padding token if needed
tokenizer.pad_token = tokenizer.eos_token
model.resize_token_embeddings(len(tokenizer))

# Create dataset from your jokes.txt
def load_dataset(file_path, tokenizer, block_size=128):
    return TextDataset(
        tokenizer=tokenizer,
        file_path=file_path,
        block_size=block_size,
    )

dataset = load_dataset("jokes.txt", tokenizer)

# Data collator (handles batching and masking)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./joke_model",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=500,
    save_total_limit=1,
    logging_steps=100,
    evaluation_strategy="no"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save model
trainer.save_model("./joke_model")
tokenizer.save_pretrained("./joke_model")

tokenizer = GPT2Tokenizer.from_pretrained("./joke_model")
model = GPT2LMHeadModel.from_pretrained("./joke_model")

generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompts = [
    "Why did the chicken",
    "I told my friend",
    "My dog",
    "What's the deal with",
    "When I was young,"
]

print("Generated Jokes:\n")
for prompt in prompts:
    result = generator(prompt, max_length=40, num_return_sequences=1)
    print(f"{prompt} → {result[0]['generated_text']}\n")



Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,3.9768
200,3.8715
300,3.8317
400,3.7895
500,3.7892
600,3.7791
700,3.7478
800,3.7278
900,3.7045
1000,3.7238


Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Generated Jokes:

Why did the chicken → Why did the chicken cross the basketball court? He played with jizz
What's black and white and pink and red all over? The prison escapee.
Did you hear about the guy who

I told my friend → I told my friend who's a porn star She thought I was an expert and warned her before she got in.
I went in to see the doctor once And she said: "Doc, I

My dog → My dog loves sports. So does he.
How did the egg say when it got hit by a car? How did the egg say when its hit by a car? "I got hit by

What's the deal with → What's the deal with fat chicks? They can get laid for anything.
Which one do you hear the most about? One who can fly and another who is a dog.
When I'm

When I was young, → When I was young, I wanted to be a priest The priest wanted something nice to say to me
Why was 6 afraid of 7? Because 7 is a registered six offender! -Adam Scott,

