<a href="https://colab.research.google.com/github/AzizMosbah/TextGeneration/blob/main/TextGeneration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning a light version of GPT2 for Language Generation

In this notebook we will be finetuning a state of the art language generation model to show if we can program an AI to have humour. 

Go straight to the last cells if you want to see it crack some jokes

## Importing the libraries

In [None]:
!pip install transformers

In [21]:
import re
import json
from sklearn.model_selection import train_test_split
from transformers import GPT2Model, GPT2Tokenizer
import random
from transformers import pipeline
from transformers import TextDataset,DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead
import pandas as pd
import torch
from transformers.optimization import Adafactor, AdafactorSchedule
from transformers import AutoTokenizer, AutoModelForCausalLM

## Setting the parameters of the model 

In [22]:
model_name = "gpt2"
train_path = "/content/drive/MyDrive/Colab Notebooks/Data/jokes.txt"
output_dir="./gpt2-distil" #The output directory where the model is saved
num_train_epochs=5 # number of training epochs
batch_size=32 # batch size
eval_steps=400 # Number of update steps between two evaluations.
save_steps=800 # after # steps model is saved
warmup_steps=500 # number of warmup steps for learning rate scheduler
lr=5e-5

## Loading and pre-processing the data 

In [23]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

def load_dataset(train_path, tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset, data_collator

In [24]:
train_dataset, data_collator = load_dataset(train_path, tokenizer)




In [31]:
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True, 
    num_train_epochs=num_train_epochs, 
    per_device_train_batch_size=batch_size, 
    per_device_eval_batch_size= batch_size,  
    eval_steps = eval_steps, 
    save_steps= save_steps,  
    warmup_steps= warmup_steps,
    prediction_loss_only=True,
    logging_steps = 100,
    )

optimizer = Adafactor(
    model.parameters(),
    lr=5e-5,
    eps=(1e-30, 5e-5),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False,
)

lr_scheduler = AdafactorSchedule(optimizer)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    optimizers=(optimizer, lr_scheduler),
    
)



PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


## Training the model and building the jokes pipeline

In [None]:
trainer.train()

In [52]:
def cracks_a_joke(input):
  input_ids = tokenizer.encode(input, return_tensors='pt')
  greedy_output = model.generate(input_ids, max_length=50)
  print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

## Let's hear it ! 😆

In [50]:
cracks_a_joke('A priest enters a pub')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


A priest enters a pub. The priest says, "Hey, what's this, some kind of joke?" 


In [54]:
cracks_a_joke('What is')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is the difference between a garbanzo bean and a chickpea? I wouldn't pay $200 to have a garbanzo bean on my face. 


In [59]:
cracks_a_joke('What do you call')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What do you call a dog with no legs? A shih tzu. 


In [91]:
cracks_a_joke('What do you call')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What do you call a group of people who are in a relationship with each other? A group of people who are in a relationship with each other. 


In [90]:
cracks_a_joke('When is')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


When is the best time to go to the dentist? Tooth hurty. 


In [84]:
cracks_a_joke('knock')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


knock Knock knock knock. Who's there? The guy who? The guy who? The guy who? The guy who? The guy who? The guy who? The guy who? The guy who? The guy who? The guy who?


In [73]:
cracks_a_joke("Who's")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Who's there? The guy who's in the bathroom. 
