<a href="https://colab.research.google.com/github/AslauAlexandru/Prodigy-InfoTech-Generative-AI-Internship-Tasks/blob/main/Prodigy_InfoTech_Generative_AI_Internship_Task_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task-01

## Text Generation with GPT-2

Train a model to generate coherent and contextually relevant text based on a given prompt. Starting with GPT-2, a transformer model developed by OpenAI, you will learn how to fine-tune the model on a custom dataset to create text that mimics the style and structure of your training data.



In [1]:
!pip install transformers
!pip install accelerate



# Text Generation with GPT-2 and GPT-2 Fine-Tuning

In [2]:
# Food.com - Recipes and Reviews dataset link:
# https://www.kaggle.com/datasets/irkaal/foodcom-recipes-and-reviews

!kaggle datasets download -d irkaal/foodcom-recipes-and-reviews
!unzip /content/foodcom-recipes-and-reviews.zip
!rm /content/foodcom-recipes-and-reviews.zip

Dataset URL: https://www.kaggle.com/datasets/irkaal/foodcom-recipes-and-reviews
License(s): CC0-1.0
Downloading foodcom-recipes-and-reviews.zip to /content
100% 721M/723M [00:19<00:00, 38.3MB/s]
100% 723M/723M [00:19<00:00, 38.0MB/s]
Archive:  /content/foodcom-recipes-and-reviews.zip
  inflating: recipes.csv             
  inflating: recipes.parquet         
  inflating: reviews.csv             
  inflating: reviews.parquet         


In [3]:
import pandas as pd

import torch
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler

from transformers import GPT2LMHeadModel, GPT2TokenizerFast, GPT2Config
from transformers import get_linear_schedule_with_warmup

from tqdm.auto import tqdm
import random
import datetime
import time

# GPT-2 Fine-Tuning

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"device {device}")

model_name = "gpt2"  # options: ['gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl']
model_save_path = './model'

device cuda


In [5]:
configuration = GPT2Config.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, config=configuration)

tokenizer = GPT2TokenizerFast.from_pretrained(model_name)

input_sequence = "beef, salt, pepper"
input_ids = tokenizer.encode(input_sequence, return_tensors='pt')

model = model.to(device)
#combine both sampling techniques
sample_outputs = model.generate(
                              input_ids.to(device),
                              do_sample = True,
                              max_length = 120,
                              top_k = 50,
                              top_p = 0.85,
                              num_return_sequences = 3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('  ---')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Output:
----------------------------------------------------------------------------------------------------
0: beef, salt, pepper)

2 medium mushrooms, cut into cubes

3 garlic cloves, minced (I used the garlic), minced (I used the sautéed cloves)

1 tsp minced fresh thyme (or any other sweetener)

1 tsp ground nutmeg

3-4 tbsp chopped fresh rosemary leaves, divided

1/2 cup cooked and boiled cilantro (or some dried basil leaves if you're making this dish)

1-2 tbsp chopped fresh cilantro, or you can substitute a little or a large portion...
  ---
1: beef, salt, pepper, garlic, and parsley.

And then I'm back on the patio with my kids.

It's about time for a little picnic on the side of the street. And I'd like to start the day with a big potluck.

I'm sure my kids might be hungry as well, but it's time for a picnic, so that they can catch a few of my favorites and drink some beer.

One of the coolest things about my trip is that, although I don't know how much it's been like for me..

In [6]:
df_recipes = pd.read_csv('/content/recipes.csv')
df_recipes.reset_index(drop=True, inplace=True)

df_recipes = df_recipes[["RecipeId", "Name", "RecipeIngredientParts", "RecipeInstructions"]].iloc[:1000]
print(list(df_recipes.columns))
print(f"data shape {df_recipes.shape}")

['RecipeId', 'Name', 'RecipeIngredientParts', 'RecipeInstructions']
data shape (1000, 4)


In [7]:
import nltk
nltk.download('punkt')
import numpy as np

doc_lengths = []

for rec in df_recipes.itertuples():

    # get rough token count distribution
    tokens = nltk.word_tokenize(rec.RecipeIngredientParts + ' ' + rec.RecipeInstructions)

    doc_lengths.append(len(tokens))

doc_lengths = np.array(doc_lengths)
# the max token length
print(f"% documents > 180 tokens: {round(len(doc_lengths[doc_lengths > 180]) / len(doc_lengths) * 100, 1)}%")
print(f"Average document length: {int(np.average(doc_lengths))}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


% documents > 180 tokens: 42.7%
Average document length: 186


In [8]:
def form_string(ingredient,instruction):
    # s = f"<|startoftext|>Ingredients:\n{ingredient.strip()}\n\nInstructions:\n{instruction.strip()}<|endoftext|>"
    s = f"<|startoftext|>Ingredients: {ingredient.strip()}. " \
        f"Instructions: {instruction.strip()}<|endoftext|>"
    return s

def extract_string(recipe):
    str = recipe.replace('<|startoftext|>', '').replace('<|endoftext|>', '')
    inst_pos = str.find('Instructions: ')
    ingredients = str[len('Ingredients: '): inst_pos-1]
    instructions = str[inst_pos+len('Instructions: '):]
    return ingredients, instructions

data = df_recipes.apply(lambda x:form_string(
    x['RecipeIngredientParts'], x['RecipeInstructions']), axis=1).to_list()
data[0]

'<|startoftext|>Ingredients: c("blueberries", "granulated sugar", "vanilla yogurt", "lemon juice"). Instructions: c("Toss 2 cups berries with sugar.", "Let stand for 45 minutes, stirring occasionally.", "Transfer berry-sugar mixture to food processor.", "Add yogurt and process until smooth.", "Strain through fine sieve. Pour into baking pan (or transfer to ice cream maker and process according to manufacturers\' directions). Freeze uncovered until edges are solid but centre is soft.  Transfer to processor and blend until smooth again.", "Return to pan and freeze until edges are solid.", "Transfer to processor and blend until smooth again.", \n"Fold in remaining 2 cups of blueberries.", "Pour into plastic mold and freeze overnight. Let soften slightly to serve.")<|endoftext|>'

In [9]:
tokenizer = GPT2TokenizerFast.from_pretrained(model_name,
                                              bos_token='<|startoftext|>',
                                              eos_token='<|endoftext|>',
                                              unk_token='<|unknown|>',
                                              pad_token='<|pad|>'
                                             )



In [10]:
vocab_list = sorted(tokenizer.vocab.items(), key=lambda x:x[1])
for i in range(5555, 5566):
    print(vocab_list[i])

('ĠPhoto', 5555)
('Ġplus', 5556)
('rick', 5557)
('arks', 5558)
('Ġalternative', 5559)
('Ġpil', 5560)
('Ġapprox', 5561)
('that', 5562)
('Ġobjects', 5563)
('ĠRo', 5564)
('ĠAndroid', 5565)


In [11]:
print("The max model length is {} for this model".format(tokenizer.model_max_length))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
print("The beginning of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The unknown token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.unk_token_id), tokenizer.unk_token_id))
print("The padding token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.pad_token_id), tokenizer.pad_token_id))

The max model length is 1024 for this model
The end of sequence token <|endoftext|> has the id 50256
The beginning of sequence token <|startoftext|> has the id 50257
The unknown token <|unknown|> has the id 50258
The padding token <|pad|> has the id 50259


In [12]:
# GPT2 is a large model. Increasing the batch size above 2 has lead to out of memory problems.
batch_size = 2
max_length = 180  # maximum sentence length

# standard PyTorch approach of loading data in using a Dataset class.
class RecipeDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.input_ids = []
        self.attn_masks = []
        self.origin_ingredients = []
        self.origin_instructions = []

        for recipe in data:
            encodings = tokenizer.encode_plus(recipe,
                                              truncation=True,
                                              padding='max_length',
                                              max_length=max_length,
                                              return_tensors='pt'       # return PyTorch tensor
                                             )
            self.input_ids.append(torch.squeeze(encodings['input_ids'],0))
            # attention_mask tells model not to incorporate these PAD tokens into its interpretation of the sentence
            self.attn_masks.append(torch.squeeze(encodings['attention_mask'],0))
            ingredients, instructions = extract_string(recipe)
            self.origin_ingredients.append(ingredients)
            self.origin_instructions.append(instructions)


    def __len__(self):
        return len(self.data)

    def __getitem__(self,idx):
        return self.input_ids[idx], self.attn_masks[idx], self.origin_ingredients[idx], self.origin_instructions[idx]

In [13]:
dataset = RecipeDataset(data, tokenizer)

# Split into training and validation sets
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

  900 training samples
  100 validation samples


In [14]:
print(f"dataset size {dataset.__len__()}")
print(f"dataset[0]: \n  input_ids: {dataset[0][0]}\n  attn_masks: {dataset[0][1]}")

dataset size 1000
dataset[0]: 
  input_ids: tensor([50257, 41222,    25,   269,  7203, 17585, 20853,  1600,   366, 46324,
         4817,  7543,  1600,   366, 10438,  5049, 32132,  1600,   366,   293,
         2144, 13135, 11074, 27759,    25,   269,  7203,    51,   793,   362,
        14180, 36322,   351,  7543, 33283,   366,  5756,  1302,   329,  4153,
         2431,    11, 26547, 10491, 33283,   366, 43260,   275,  6996,    12,
           82, 35652, 11710,   284,  2057, 12649, 33283,   366,  4550, 32132,
          290,  1429,  1566,  7209, 33283,   366,  1273,  3201,   832,  3734,
          264, 12311,    13, 39128,   656, 16871,  3425,   357,   273,  4351,
          284,  4771,  8566, 16009,   290,  1429,  1864,   284, 11372,     6,
        11678,   737, 34917, 18838,  1566, 13015,   389,  4735,   475,  7372,
          318,  2705,    13,   220, 20558,   284, 12649,   290, 13516,  1566,
         7209,   757, 33283,   366, 13615,   284,  3425,   290, 16611,  1566,
        13015,   389

In [15]:
# Create the DataLoaders for our training and validation datasets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

# Finetune GPT2 Language Model

In [16]:
configuration = GPT2Config.from_pretrained(model_name, output_hidden_states=False)
model = GPT2LMHeadModel.from_pretrained(model_name, config=configuration)
model = model.to(device)
print(f"Weight shape {model.transformer.wte.weight.shape}")
# this step is necessary because I've added some tokens (bos_token, etc.) to the embeddings
# otherwise the tokenizer and model tensors won't match up
model.resize_token_embeddings(len(tokenizer))
print(f"Number of tokens: {len(tokenizer)}")

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

Weight shape torch.Size([50257, 768])
Number of tokens: 50260


In [17]:
word_embeddings = model.transformer.wte.weight # Word Token Embeddings

print(word_embeddings.shape)

torch.Size([50260, 768])


In [18]:
epochs = 3
learning_rate = 2e-5
warmup_steps = 1e2
# The epsilon parameter eps = 1e-8 is “a very small number to prevent any division by zero in the implementation”
epsilon = 1e-8
# optim = Adam(model.parameters(), lr=5e-5)
optim = AdamW(model.parameters(), lr = learning_rate, eps = epsilon)

def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

In [19]:
# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
# This changes the learning rate as the training loop progresses
scheduler = get_linear_schedule_with_warmup(optim,
                                            num_warmup_steps = warmup_steps,
                                            num_training_steps = total_steps)

In [20]:
def infer(prompt):
    input = f"<|startoftext|>Ingredients: {prompt.strip()}"
    input = tokenizer(input, return_tensors="pt")
    input_ids      = input["input_ids"]
    attention_mask = input["attention_mask"]

    output = model.generate(input_ids.to(device),
                            attention_mask=attention_mask.to(device),
                            max_new_tokens=max_length,
                            # temperature = 0.5,
                            do_sample = True, top_k = 50, top_p = 0.85)
                            # num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
    output = tokenizer.decode(output[0], skip_special_tokens=True)
    return output

In [21]:
total_t0 = time.time()

training_stats = []

for epoch_i in range(0, epochs):
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    t0 = time.time()

    total_train_loss = 0

    model.train()  # `train` just changes the *mode* (train vs. eval), it doesn't *perform* the training.

    for step, batch in enumerate(train_dataloader):     # step from enumerate() = number of batches

        b_input_ids = batch[0].to(device)   # tokens (of multiple documents in a batch)
        b_labels    = batch[0].to(device)
        b_masks     = batch[1].to(device)   # mask of [1] for a real word, [0] for a pad

        model.zero_grad()
        # loss = model(X.to(device), attention_mask=a.to(device), labels=X.to(device)).loss
        outputs = model(  input_ids = b_input_ids,
                          labels = b_labels,
                          attention_mask = b_masks,
                          token_type_ids = None
                        )

        loss = outputs[0]

        batch_loss = loss.item()
        total_train_loss += batch_loss

        # Get sample every x batches.
        if step % 100 == 0 and not step == 0:

            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}. Loss: {:>5,}.   Elapsed: {:}.'.format(step, len(train_dataloader), batch_loss, elapsed))

            model.eval()

            sample_output = infer("eggs, flour, butter, sugar")
            print(sample_output)

            # `train` just changes the *mode* (train vs. eval), it doesn't *perform* the training.
            model.train()

        loss.backward()
        optim.step()
        scheduler.step()

    # Calculate the average loss over all the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)

    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))


    print("")
    print("Running Validation...")

    t0 = time.time()

    model.eval()

    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        with torch.no_grad():

            outputs  = model(input_ids = b_input_ids,
                             attention_mask = b_masks,
                             labels = b_labels)

            loss = outputs[0]

        batch_loss = loss.item()
        total_eval_loss += batch_loss

    avg_val_loss = total_eval_loss / len(validation_dataloader)

    validation_time = format_time(time.time() - t0)

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   100  of    450. Loss: 3.2267932891845703.   Elapsed: 0:00:16.
Ingredients: eggs, flour, butter, sugar, brown sugar, and saltMix togetherIn oven, butter, sugar, and salt; butter, until butter melts and the cream is softened,Remove from heat; andHeat in pans,Scream together; andSpray with coconut oil,Scream together;andCook, covered incook,For a few minutes, until softened,AndCook, covered incook,AndCook until the butter has melted,but the cream is melted,andthe mixtureAdd in, andAdd in,and,andScout andCook,Remove from heat;Cook,Bake,Covered inCarb mixture,Spread mixture inCarbo,Mix all inCook,Preheat ovenTo theCook,Heat,In a pan,Heat2,Curry,Poil,Bread,Curry,Curry,CombWithCoil andMixedIn a saucepan,Foil,


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   200  of    450. Loss: 3.2050235271453857.   Elapsed: 0:00:33.
Ingredients: eggs, flour, butter, sugar, baking powder, baking soda, cinnamon, and salt, salt and pepper; and heavy cream in a mixer; vanilla, molasses, vanilla bean powder, and baking soda; cream butter, powdered sugar, cinnamon, and salt; and beat until light and fluffy; stir in cream; cream cheese, softened butter, and baking powder; cream cheese mixture; and beat until well blended; stir in remaining ingredients; mixture in remaining cream; butter; and blend on high speed until creamy; blend in egg, vanilla, vanilla bean powder, and baking soda; blend in remaining ingredients; mixture in mixture; and beat until creamy; blend in cream; blend in cream; butter; and blend in top of bowl; mixture in butter; and beat until smooth; remove from heat; add remaining eggplant; set aside; refrigerate for 2 hours; spray remaining tablespoonfuls with remaining milk and cool


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   300  of    450. Loss: 2.3739562034606934.   Elapsed: 0:00:50.
Ingredients: eggs, flour, butter, sugar, brown sugar, baking soda, vanilla, and salt: 3/4 cup milk, melted butter and cocoa mixture.Stir in salt, egg mixture and milk. Mix thoroughly and stir in remaining ingredients.Cover with plastic wrap and chill at room temperature.Transfer to plastic wrap and chill at room for up hour until ready to use:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   400  of    450. Loss: 2.195024251937866.   Elapsed: 0:01:04.
Ingredients: eggs, flour, butter, sugar and waterPour the wet ingredients into a lightly greased 8-inch pie pan and bake for 10 minutes, or until a toothpick inserted into the center comes out clean, stirring occasionally; then remove from the pan and cool completely.Stir in flour and stir in the flour mixture until the mixture is incorporated. Add the remaining ingredients and cook for 2-3 minutes, or until the mixture has reduced in color, stirring constantly. Drain the dough and allow to cool to room temperature before cutting into slices.

  Average training loss: 6.40
  Training epoch took: 0:01:13

Running Validation...
  Validation Loss: 2.11
  Validation took: 0:00:02

Training...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   100  of    450. Loss: 1.8072260618209839.   Elapsed: 0:00:15.
Ingredients: eggs, flour, butter, sugar, vanilla extract, and saltPour the butter and sugar over the egg yolks and beat until well blended. Pour the batter into a 8x8x8-inch baking dish or baking dish and Bake for 35-45 minutes or until golden brown, but not brown or have brown spots.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   200  of    450. Loss: 2.3154172897338867.   Elapsed: 0:00:30.
Ingredients: eggs, flour, butter, sugar, brown sugar, vanilla, salt, pepper and cinnamon, salt and pepper


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   300  of    450. Loss: 2.5706303119659424.   Elapsed: 0:00:45.
Ingredients: eggs, flour, butter, sugar, vanilla extract and vanilla extract (not sugar-free or margarine) Instructions: c("Slice into small pieces or cut into large pieces - place in a muffin tin.")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   400  of    450. Loss: 2.086430788040161.   Elapsed: 0:00:59.
Ingredients: eggs, flour, butter, sugar, salt, pepper, cayenne pepper, pepper, pepper jack cheese, paprika, oregano, cayenne pepper; cumin, cumin seeds, cumin flakes; cumin powder; cumin seeds; cumin oil; cumin, cumin meal; cumin seed mixture; cumin seeds; cumin oil; cumin blend; cumin mixture; cumin seed mixture; cumin oil; cumin powder; cumin, paprika, cumin seeds, cumin oil; cumin powder; cumin powder; cumin powder; cumin powder; cumin powder; cumin oil; cumin powder; cumin blend; cumin powder; cumin oil; cumin powder; cumin blend; cumin powder; cumin powder; cumin blend; cumin powder; cumin powder; cumin powder; cumin

  Average training loss: 2.19
  Training epoch took: 0:01:09

Running Validation...
  Validation Loss: 1.99
  Validation took: 0:00:02

Training...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   100  of    450. Loss: 1.098281741142273.   Elapsed: 0:00:14.
Ingredients: eggs, flour, butter, sugar and cayenne pepper; cayenne salt to taste; salt and pepper to taste; cayenne pepper powder to taste; brown sugar, to taste; cayenne; 1/2 teaspoon vanilla extract; pepper; cayenne; 1 cup water; cayenne pepper, paprika, paprika powder, cayenne pepper; cayenne; 2 tablespoons water; cayenne pepper, cumin, chili powder; cayenne; 1/4 cup cilantro; cilantro leaves; cumin powder; cayenne; 1/4 teaspoon cumin; cayenne; 1 tablespoon paprika; 1 tablespoon water; cumin; 1 teaspoon garlic powder; 1 tablespoon sugar; cayenne; cumin; 2 tablespoons vegetable oil; cumin; 1 teaspoon salt; cayenne; 1 tablespoon paprika; 1 tablespoon chili powder; cay


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   200  of    450. Loss: 2.8963727951049805.   Elapsed: 0:00:30.
Ingredients: eggs, flour, butter, sugar, and vanilla extract


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   300  of    450. Loss: 1.722977638244629.   Elapsed: 0:00:44.
Ingredients: eggs, flour, butter, sugar, salt, vanilla extract, salt, and pepper to taste Instructions:In a mixing bowl, combine eggs, flour, butter, sugar, salt, vanilla extract, and salt; blend until fluffy, scraping down sides. Mix well and refrigerate at least 4-5 hours.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  Batch   400  of    450. Loss: 2.0226526260375977.   Elapsed: 0:00:59.
Ingredients: eggs, flour, butter, sugar, salt, vanilla extract, salt and pepper to taste, cumin, lemon zest, paprika and nutmeg. Instructions: c("Place 1 cup of batter in greased 6x8 baking pan and let cool before adding flour; flip pan over; add in flour mixture and mix well until combined. Pour batter into greased 4x8 baking pan and cook until browned and set aside. In a separate bowl, mix egg mixture and flour.", "Cut off flours. Heat 2 tablespoons of oil in greased 6x8 pan, and mix well.", "Sprinkle with lemon zest.", "Preheat oven to 350 degreesF.", "Grease 9x8 baking pan with greased 6x8 baking pan. Spread batter on top.", "Press 2 teaspoon of batter evenly onto the top of each cookie.", "Drain batter.", "Bake at 350 degrees

  Average training loss: 2.10
  Training epoch took: 0:01:09

Running Validation...
  Validation Loss: 1.96
  Validation took: 0:00:02

Training complete!
Total training took 0:03:36 (h:

In [22]:
df_stats = pd.DataFrame(data=training_stats)

# Use the 'epoch' as the row index.
df_stats = df_stats.set_index('epoch')
df_stats

Unnamed: 0_level_0,Training Loss,Valid. Loss,Training Time,Validation Time
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,6.396286,2.105686,0:01:13,0:00:02
2,2.194378,1.991207,0:01:09,0:00:02
3,2.102946,1.961232,0:01:09,0:00:02


In [23]:
print("Saving model to %s" % model_save_path)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
# model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

Saving model to ./model


('./model/tokenizer_config.json',
 './model/special_tokens_map.json',
 './model/vocab.json',
 './model/merges.txt',
 './model/added_tokens.json',
 './model/tokenizer.json')

In [24]:
model = GPT2LMHeadModel.from_pretrained(model_save_path)
tokenizer = GPT2TokenizerFast.from_pretrained(model_save_path)
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50260, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50260, bias=False)
)

# Generate Text

In [25]:
# model = GPT2LMHeadModel.from_pretrained(model_save_path)
# tokenizer = GPT2TokenizerFast.from_pretrained(model_save_path)
# model.to(device)
print(infer("eggs, mushroom, butter, sugar"))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Ingredients: eggs, mushroom, butter, sugar, cayenne pepper, salt, pepper, vanilla extract, lemon juice and cumin; 2 teaspoons vanilla extract


In [26]:
infer("onion, garlic, chicken breast")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Ingredients: onion, garlic, chicken breast, red bell pepper, cumin, turmeric, oregano, cayenne pepper, pepper flakes, curry powder, cumin leaves, ginger, thyme, salt, turmeric leaves, oregano, cumin oil, parsley, cilantro, oregano, salt, pepper, chili powder, cumin powder, oregano paste, red pepper flakes, cumin oil, red pepper flakes, cumin oil and chopped parsley to taste'

In [27]:
print(infer("avocado, lime"))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Ingredients: avocado, lime juice, coconut oil, lime juice, cayenne pepper, pepper flakes, salt, pepper, parsley, cloves, cinnamon, paprika, thyme, cayenne pepper, parsley seeds, cayenne pepper, cayenne pepper, cayenne salt, cumin, cayenne pepper, parsley, cumin powder, cloves, ginger, cloves, cumin seed, and garlic powder)


In [28]:
print(infer("beef, salt, pepper"))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Ingredients: beef, salt, pepper, cumin, ginger, garlic powder, cumin, ginger oil, ginger, cloves, and lemon juice; cumin, ginger, lime juice, and paprika; cumin leaves; cumin oregano; coriander seeds; coriander seeds; cumin, thyme, paprika, oregano; coriander; thyme, sage, oregano; thyme, lemon juice; cumin; thyme, thyme, oregano; thyme, thyme, sage, basil; cumin seeds; cumin, parsley, thyme, parsley powder; coriander, parsley; thyme, thyme, thyme, thyme, thyme, thyme, thyme, thyme, thyme, thyme, thyme, thyme, thyme, thyme, thyme, thy


# Text Generation with GPT-2 and without GPT-2 Fine-Tuning

In [29]:
import torch
from transformers import AutoTokenizer,LlamaTokenizer, LlamaForCausalLM, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
# Model name
model_name = "gpt2"
# Tokenize the text
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create a model for causal language modeling
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, attn_implementation="sdpa").to(device)



In [30]:
# Define the prompt
prompt = "Hello, how are you today?"

# Tokenize the prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

print(type(input_ids))
print(torch.Tensor)
print(input_ids)

<class 'torch.Tensor'>
<class 'torch.Tensor'>
tensor([[15496,    11,   703,   389,   345,  1909,    30]], device='cuda:0')


In [31]:
# Generate text
output = model.generate(input_ids=input_ids).to(device)

print(output)
type(output)
print(torch.Tensor)


# Decode and print the generated text
generated_text = tokenizer.decode(output[0])

print(generated_text)

# Generate text
output = model.generate(input_ids=input_ids ,
                        max_length=100,
                        num_return_sequences=1)
#print(output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[15496,    11,   703,   389,   345,  1909,    30,   198,   198,    40,
          1101,   523,  3772,   284,   307,   994,    13,   314,  1101,   523]],
       device='cuda:0')
<class 'torch.Tensor'>
Hello, how are you today?

I'm so happy to be here. I'm so


In [32]:
# Generate text
output = model.generate(input_ids=input_ids ,
                        max_length=100,
                        num_return_sequences=1)
#print(output)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0],
                                  skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, how are you today?

I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so happy to be here. I'm so


In [33]:
# Generate text
output = model.generate(input_ids=input_ids ,
                        max_length=100,
                        num_return_sequences=1,
                        temperature=1.0,
                        repetition_penalty=2.0,
                        top_p=0.9)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0],
                                  skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, how are you today?
I'm so happy to be here. I've been working on this project for a while now and it's finally finished! It was really fun doing the first part of my story but then when we started writing our second chapter in January 2015 (and that is before Christmas), things got very complicated because there were no other characters or stories available at all… So after about two months without any new material coming out from me as well as some work being done by


## Greedy Search

In [34]:
from accelerate import Accelerator
accelerator = Accelerator()

device = accelerator.device
#device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


# encode context the generation is conditioned on
model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').to(device)
# generate 40 new tokens
greedy_output = model.generate(**model_inputs, max_new_tokens=40)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure


## Top-K Sampling

In [35]:
#Let's see how Top-K can be used in the library by setting top_k=50:
# set seed to reproduce results. Feel free to change the seed though to get different results
from transformers import set_seed

# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog for the rest of the day, but this time it was hard for me to decide if I should start my dog on walks with my big sister.

My puppy was also a complete toss-


## Top-p (nucleus) sampling

In [36]:
#We activate Top-p sampling by setting 0 < top_p < 1:

# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_p=0.92,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog for the rest of the day, but this had me staying in an unusual room and not going on nights out with friends (which leads me to wondered how worried she is about letting me sleep in the


In [37]:
#Finally, to get multiple independently sampled outputs,
#we can again set the parameter num_return_sequences > 1:

# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog for the rest of the day, but this time it was hard for me to figure out what to do with it. When I finally looked at this one, I knew I had something to be excited
1: I enjoy walking with my cute dog. He has this weird sense of smell. I like walking with him, especially when it comes to walking indoors or if it's a bit chilly. He does walk a lot of walking in our house
2: I enjoy walking with my cute dog.

You can follow a few of my dogs in their training and on their adventures.

Some dogs are very self taught. These are good for training your pet to thrive.


