<a href="https://colab.research.google.com/github/Jihen-Belhoudi/Text_Generation_using_GPT_Neo/blob/main/Text_Generation_using_GPT_Neo_1_3B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#GPT-Neo 1.3B

**Model Description:**
GPT-Neo 1.3B is a transformer model designed using EleutherAI's replication of the GPT-3 architecture. GPT-Neo refers to the class of models, while 1.3B represents the number of parameters of this particular pre-trained model.

**Training data:**
GPT-Neo 1.3B was trained on the Pile, a large scale curated dataset created by EleutherAI for the purpose of training this model.

**Training procedure:**
This model was trained on the Pile for 380 billion tokens over 362,000 steps. It was trained as a masked autoregressive language model, using cross-entropy loss.



The steps of our work:

1. **Token Probability Distribution**: After processing the initial text, the model generates a probability distribution over the entire vocabulary for the next token. This distribution is typically obtained by passing the model's output (logits) through a softmax function, which converts the logits into probabilities.

2. **Sampling Strategy**: Once the probability distribution is obtained, there are several strategies for selecting the next token:

    a. **Greedy Sampling (top_k = 1)**: This strategy simply selects the token with the highest probability as the next token. It's quick and easy but might lead to repetitive or less diverse outputs.

    b. **Top-k Sampling (top_k > 1)**: In this approach, the model considers only the top-k tokens with the highest probabilities and samples from this reduced set. This allows for some level of diversity while still maintaining control over the selection process.

    c. **Temperature Scaling**: Another technique involves using a temperature parameter during sampling. This parameter controls the steepness of the softmax function, affecting the diversity of the generated text. Higher temperatures lead to more randomness in the sampling process, while lower temperatures make the sampling more deterministic.

    d. **Random Sampling**: This strategy involves sampling directly from the entire probability distribution, where each token's probability serves as its likelihood of being chosen. This approach allows for the greatest level of diversity in the generated text.

3. **Optimal Next Word**: The optimal next word is subjective and depends on the specific task and desired outcomes. While greedy sampling might produce coherent text quickly, it could lack diversity. On the other hand, random sampling could lead to more creative and diverse outputs but might sacrifice coherence.

4. **Balancing Exploration and Exploitation**: Choosing the appropriate sampling strategy involves balancing exploration (seeking new, diverse outputs) and exploitation (leveraging known high-probability tokens for coherent text generation). The choice often depends on the application and desired trade-offs between novelty and coherence in the generated text.

In summary, the model's ability to find the optimal next word depends on the sampling strategy employed, which balances exploration and exploitation to generate diverse yet coherent text.

In [1]:
import pandas as pd

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Check if GPU is available, set device accordingly
device = "cuda" if torch.cuda.is_available() else "cpu"

# Define the model name
model_name = "EleutherAI/gpt-neo-1.3B"

# Load the tokenizer from the pretrained model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model from the pretrained model
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


# Top-k Sampling (top_k > 1)

In [3]:
starting_prompt = "I am working"
# Encode input text to PyTorch tensors.
input_tokens = tokenizer.encode(starting_prompt, return_tensors="pt").to(device) #return_tensors="pt"= tokens will be represented as tensors
iterations = []
num_generation_steps = 50
num_choices_per_step = 100
# Generate text iteratively
with torch.no_grad():
    for step in range(num_generation_steps):

        # Create a dictionary for current iteration data
        current_iteration = {"Input": tokenizer.decode(input_tokens[0])}

        # Generate token probabilities and sort for best choices
        model_output = model(input_ids=input_tokens)
        next_token_probabilities = torch.softmax(model_output.logits[0, -1, :], dim=-1)
        top_token_ids = torch.argsort(next_token_probabilities, descending=True)

        # Store top choices with probabilities
        for choice_index in range(num_choices_per_step):
            token_id = top_token_ids[choice_index].item()
            token_probability = next_token_probabilities[token_id].item()

            token_choice = f"{tokenizer.decode(token_id)} ({100 * token_probability:.2f}%)"
            current_iteration[f"Choice {choice_index+1}"] = token_choice

        # Append the most likely token to input
        input_tokens = torch.cat([input_tokens, top_token_ids[None, 0, None]], dim=-1)

        # Add iteration data to list
        iterations.append(current_iteration)
# Create a DataFrame for clear presentation
output_df = pd.DataFrame(iterations)
print(output_df)



                                                Input             Choice 1  \
0                                        I am working          on (67.52%)   
1                                     I am working on           a (49.95%)   
2                                   I am working on a     project (12.61%)   
3                           I am working on a project        that (26.01%)   
4                      I am working on a project that    requires (15.65%)   
5             I am working on a project that requires           a (14.90%)   
6           I am working on a project that requires a          lot (5.85%)   
7       I am working on a project that requires a lot          of (97.73%)   
8    I am working on a project that requires a lot of         data (3.91%)   
9   I am working on a project that requires a lot ...          to (19.17%)   
10  I am working on a project that requires a lot ...          be (86.13%)   
11  I am working on a project that requires a lot ...      store

In [4]:
# Encode the starting prompt into input tensors and attention mask tensors
input_ids = tokenizer(starting_prompt, return_tensors="pt").input_ids.to(device) # Encoding input into PyTorch tensors
attention_mask = tokenizer(starting_prompt, return_tensors="pt").attention_mask.to(device) # Attention mask for input

# Generate text using the language model
output = model.generate(
    input_ids=input_ids, # Provide input tensors
    attention_mask=attention_mask, # Provide attention mask tensors
    max_new_tokens=num_generation_steps, # Maximum number of tokens to generate
    do_sample=False # Disable sampling to get deterministic output
)

# Decode and print the generated text
print(tokenizer.decode(output[0])) # Decode the generated tokens and print the resulting text


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I am working on a project that requires a lot of data to be stored in a database. I am using the Entity Framework to store the data. I have a class called User that has a property called UserName. I am using the UserName property to store


# Greedy Search

In [None]:
max_length = 120
# Define the input text
input_txt = """The ancient oracle foretold a hero rising from the ashes, but \
only the whispers of forgotten magic offer clues. \
Where does the hero's journey begin?\n\n
"""

# Encode the input text into input tensors
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device) # Encode input into PyTorch tensors

# Generate text using greedy decoding with a maximum length constraint
output_greedy = model.generate(
    input_ids, # Provide input tensors
    max_length=max_length, # Set the maximum length for the generated text
    do_sample=False, # Disable sampling to use greedy decoding
    pad_token_id=tokenizer.eos_token_id # Set the pad token ID to handle sequences shorter than max_length
)

# Decode and print the generated text
print(tokenizer.decode(output_greedy[0])) # Decode the generated tokens and print the resulting text


# Beam Search

In [6]:
# Generate text using beam search decoding with specified parameters
output_beam = model.generate(
    input_ids,  # Provide input tensors
    max_length=max_length,  # Set the maximum length for the generated text
    num_beams=3,  # Set the number of beams for beam search decoding
    do_sample=False,  # Disable sampling to use beam search decoding
    no_repeat_ngram_size=2,  # Set the size of the n-grams to avoid repetition
    pad_token_id=tokenizer.eos_token_id  # Set the pad token ID to handle sequences shorter than max_length
)

# Decode and print the generated text
print(tokenizer.decode(output_beam[0]))  # Decode the generated tokens and print the resulting text


The ancient oracle foretold a hero rising from the ashes, but only the whispers of forgotten magic offer clues. Where does the hero's journey begin?



The hero’s journey is the journey of a man or a woman, a child or an adult, from one place to another. It is a journey that begins with a single step and ends in a new place. The hero is not the one who takes the first step. He or she is simply the person who has taken the next step, and so on, until they reach their destination.

<|endoftext|>


In [None]:
# Generate text using beam search decoding with specified parameters
output_beam = model.generate(
    input_ids,  # Provide input tensors
    max_length=max_length,  # Set the maximum length for the generated text
    num_beams=5,  # Set the number of beams for beam search decoding
    do_sample=False,  # Disable sampling to use beam search decoding
    no_repeat_ngram_size=2,  # Set the size of the n-grams to avoid repetition
    pad_token_id=tokenizer.eos_token_id  # Set the pad token ID to handle sequences shorter than max_length
)

# Decode and print the generated text
print(tokenizer.decode(output_beam[0]))  # Decode the generated tokens and print the resulting text


# Temperature

In [None]:
# Generate text using sampling with temperature scaling
output_temp = model.generate(
    input_ids,  # Provide input tensors
    max_length=max_length,  # Set the maximum length for the generated text
    do_sample=True,  # Enable sampling to generate diverse outputs
    temperature=2.0,  # Set the temperature parameter for temperature scaling
    top_k=0,  # Set top_k to 0 to allow all tokens to be considered during sampling
    pad_token_id=tokenizer.eos_token_id  # Set the pad token ID to handle sequences shorter than max_length
)

# Decode and print the generated text
print(tokenizer.decode(output_temp[0]))  # Decode the generated tokens and print the resulting text


In [9]:
# Generate text using sampling with temperature scaling
output_temp = model.generate(
    input_ids,  # Provide input tensors
    max_length=max_length,  # Set the maximum length for the generated text
    do_sample=True,  # Enable sampling to generate diverse outputs
    temperature=0.6,  # Set the temperature parameter for temperature scaling (lower temperature results in more conservative sampling)
    top_k=0,  # Set top_k to 0 to allow all tokens to be considered during sampling
    pad_token_id=tokenizer.eos_token_id  # Set the pad token ID to handle sequences shorter than max_length
)

# Decode and print the generated text
print(tokenizer.decode(output_temp[0]))  # Decode the generated tokens and print the resulting text


The ancient oracle foretold a hero rising from the ashes, but only the whispers of forgotten magic offer clues. Where does the hero's journey begin?



The Oracle of Delphi was the last great oracle of ancient Greece, and the first to warn of the coming of a new hero. But the Oracle never told of the hero, or his name. Now, the Oracle of Delphi is the last oracle of magic, and the first to tell of the hero’s journey.



The Oracle of Delphi first appeared in the 4th century BC,


# Top-k and Top-p

**Setting Top K**

In [10]:
# Generate text using sampling with top-k sampling
out_top_k = model.generate(
    input_ids,  # Provide input tensors
    max_length=max_length,  # Set the maximum length for the generated text
    do_sample=True,  # Enable sampling to generate diverse outputs
    top_k=40,  # Set the top_k parameter for top-k sampling (select from top 40 tokens based on logits)
    pad_token_id=tokenizer.eos_token_id  # Set the pad token ID to handle sequences shorter than max_length
)

# Decode and print the generated text
print(tokenizer.decode(out_top_k[0]))  # Decode the generated tokens and print the resulting text


The ancient oracle foretold a hero rising from the ashes, but only the whispers of forgotten magic offer clues. Where does the hero's journey begin?



We've been talking a lot about how modern, western society is destroying this land since the early 21st century. But now it seems that we can only see it from the perspective of the future. The ancient oracle of Endor foretold of a hero rising from the ashes.



What happens when the heroes' path crosses with a world once created by magic? Perhaps when they are faced with an ancient prophecy and


**Setting Top P**

In [11]:
# Generate text using sampling with nucleus sampling
out_topp = model.generate(
    input_ids,  # Provide input tensors
    max_length=max_length,  # Set the maximum length for the generated text
    do_sample=True,  # Enable sampling to generate diverse outputs
    top_p=0.80,  # Set the top_p parameter for nucleus sampling (select from the smallest set of tokens whose cumulative probability exceeds p)
    pad_token_id=tokenizer.eos_token_id  # Set the pad token ID to handle sequences shorter than max_length
)

# Decode and print the generated text
print(tokenizer.decode(out_topp[0]))  # Decode the generated tokens and print the resulting text



The ancient oracle foretold a hero rising from the ashes, but only the whispers of forgotten magic offer clues. Where does the hero's journey begin?



The prophecy was about the hero.

A great army of heroes had marched out of the dark days of the world. They were a force to be reckoned with. They would destroy the forces of darkness and rise from the ashes of the earth. The army was strong, but the path to glory lay in the hands of the true heroes. The heroes of the ancient prophecy were not the men and women that we are today.


**Setting Top P and Top K**

In [12]:
# Generate text using sampling with top-k and nucleus sampling
out_top_pk = model.generate(
    input_ids,  # Provide input tensors
    max_length=max_length,  # Set the maximum length for the generated text
    do_sample=True,  # Enable sampling to generate diverse outputs
    top_k=40,  # Set the top_k parameter for top-k sampling (select from top 40 tokens based on logits)
    top_p=0.80,  # Set the top_p parameter for nucleus sampling (select from the smallest set of tokens whose cumulative probability exceeds p)
    pad_token_id=tokenizer.eos_token_id  # Set the pad token ID to handle sequences shorter than max_length
)

# Decode and print the generated text
print(tokenizer.decode(out_top_pk[0]))  # Decode the generated tokens and print the resulting text


The ancient oracle foretold a hero rising from the ashes, but only the whispers of forgotten magic offer clues. Where does the hero's journey begin?



I have no words to describe the feeling I had when I saw that the great temple was not only a monument to the ancient world, but a museum to the gods themselves.

A monument to the ancient world

I'm sure most of us have a vague memory of our childhood trips to our local town's temple, and it was at the temple that we were taught about the gods, and the sacred history behind the
