---

In [None]:
# Presented on 25 October 2024 by Ahmed Baari

## Generating text using Large Language Models
In this notebook, we will use the GPT-2 model to generate text. GPT-2 is a large language model developed by OpenAI. It is based on the transformer architecture and has 1.5 billion parameters. The model has been trained on a large corpus of text data and can generate coherent and contextually relevant text.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model 
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
input_sequence = "I love to study at SASTRA becuase"

In [3]:
input_ids = tokenizer(input_sequence, return_tensors="pt").input_ids
input_ids

tensor([[   40,  1842,   284,  2050,   379,   311, 11262,  3861,   639,    84,
           589]])

In [4]:
# Generate the sequence

output = model.generate(input_ids, 
                        max_length=50, 
                        num_return_sequences=1, 
                        num_beams=5, 
                        no_repeat_ngram_size=2, 
                        top_k=50, 
                        top_p=0.95, 
                        temperature=1)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [5]:
# text wrapping in cell
import textwrap
wrapper = textwrap.TextWrapper(width=100)


In [6]:
# Print the generated sequences
# use text wrapper 
print("Output:")

for i, sample_output in enumerate(output):
    print(wrapper.fill("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True))))

Output:
0: I love to study at SASTRA becuase it's a great place for me to do that.  I've been looking for a
place to go to for years now and I've never been able to find one. So I decided
