## Text Generation using GPT2 Model

Will Seek to generate AD from Product description

In [3]:
! pip install transformers datasets accelerate

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.14.

In [4]:
#importing libraries
import torch
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
from datasets import Dataset

In [6]:
model_name = "gpt2-large"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = TFGPT2LMHeadModel.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [7]:
def generate_advertisement(product_description, max_length=100):
    input_text = "Product: " + product_description + "\nAdvertisement:"

    # Encode input text into ids- tokenization
    input_ids = tokenizer.encode(input_text, return_tensors="tf")

    # Generate text
    output = model.generate(input_ids, max_length=max_length)

    # decode the ids back into text
    generated_ads = []
    for sample in output:
        generated_ad = tokenizer.decode(sample, skip_special_tokens=True)
        generated_ads.append(generated_ad)

    return generated_ads

In [11]:
product_description = "Introducing our latest vehicle, that has the latest engine and fuel capacity"

generated_ads = generate_advertisement(product_description, max_length=150)

In [12]:
generated_ads

['Product: Introducing our latest vehicle, that has the latest engine and fuel capacity\nAdvertisement: Content continues below...\n\nThe new car is called the "E-Hybrid," and it\'s a hybrid of the Toyota Prius and the Honda Fit. It\'s a hybrid that\'s supposed to be more fuel efficient than the Prius, but it\'s also supposed to be more fuel efficient than the Fit.\n\nThe E-Hybrid is a hybrid of the Toyota Prius and the Honda Fit. It\'s a hybrid that\'s supposed to be more fuel efficient than the Prius, but it\'s also supposed to be more fuel efficient than the Fit.\n\nThe E-Hybrid is a hybrid of the Toyota Prius and']

## 2 Using Greedy approach

this approach, the word with highest probability is predicted as the next word

In [13]:
def generate_advertisement_greedy(product_description):
    input_text = "Product: " + product_description + "\nAdvertisement:"

    # Encode input text- use number of beams, ngram size
    input_ids = tokenizer.encode(input_text, num_beams = 7,no_repeat_ngram_size=3,num_return_sequences=5,early_stopping = True,return_tensors="tf")

    # Generate text
    output = model.generate(input_ids, max_length=150)

    # decode the ids back into text
    generated_ads = []
    for sample in output:
        generated_ad = tokenizer.decode(sample, skip_special_tokens=True)
        generated_ads.append(generated_ad)

    return generated_ads

In [14]:
generated_ads_greedy = generate_advertisement_greedy(product_description)

Keyword arguments {'num_beams': 7, 'no_repeat_ngram_size': 3, 'num_return_sequences': 5, 'early_stopping': True} not recognized.


In [15]:
generated_ads_greedy

['Product: Introducing our latest vehicle, that has the latest engine and fuel capacity\nAdvertisement: Content continues below...\n\nThe new car is called the "E-Hybrid," and it\'s a hybrid of the Toyota Prius and the Honda Fit. It\'s a hybrid that\'s supposed to be more fuel efficient than the Prius, but it\'s also supposed to be more fuel efficient than the Fit.\n\nThe E-Hybrid is a hybrid of the Toyota Prius and the Honda Fit. It\'s a hybrid that\'s supposed to be more fuel efficient than the Prius, but it\'s also supposed to be more fuel efficient than the Fit.\n\nThe E-Hybrid is a hybrid of the Toyota Prius and']