# Language Modeling with a small GPT-2 Model
checked 27.02.24 GPaa√ü

This notebook uses code from [https://huggingface.co/gpt2](https://huggingface.co/gpt2)

In [None]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)

The model has 124 million parameters and
* embeddings of length 768 for 50257 tokens
* embeddings of length 768 for 1024 positions
* 12 layers of `GPT2Block`, each with a self-attention `GPT2Attention` and a fully connected network `GPT2MLP`
* a final linear model with 768 inputs and 50257 outputs

This model was trained on WebText consisting of the text of 45 million outbound links from Reddit.

In [None]:
generator.model

The generator performs the following actions:
* encode the input text as an integer vector using the tokenizer of the model
* apply the model function `generate`. You can enter the additional parameters
  * `do_sample` (default False) Whether or not to use sampling
  * `temperature` (default 1.0) The value used to module the next token probabilities.
  * `top_k` (default 50) The number of highest probability vocabulary tokens to keep for top-k-filtering.
  * `top_p` (default 1.0) If set to float < 1, only the most probable tokens with probabilities that add up to `top_p` or higher
        are kept for generation.
  * `repetition_penalty` (default 1.0) The parameter for repetition penalty. 1.0 means no penalty.
* decode the generated integers to the tokens of the generated text

In [None]:
generator.model.generate??

In [None]:
### only produce next token with highest prbability
res=generator("When Donald Trump entered the room,", max_length=100, num_return_sequences=1, do_sample=False)
print(res[0]['generated_text'])

In [None]:
# sample next token according to computed probabilities. Each time a new text is generated.
res=generator("When Donald Trump entered the room,", max_length=100, num_return_sequences=3, do_sample=True)
for i in range(len(res)):
  print(50*'-')
  print(res[i]['generated_text'])

In [None]:
# a next token within the to 90% range of probabilities is selected
res=generator("When Donald Trump entered the room,", max_length=100, num_return_sequences=3, do_sample=True, top_p=0.9)
for i in range(len(res)):
  print(50*'-')
  print(res[i]['generated_text'])

In [None]:
# the probability estimates are "flattened" by the temperature parameter
res=generator("When Donald Trump entered the room,", max_length=100, num_return_sequences=3, temperature=3.0)
for i in range(len(res)):
  print(50*'-')
  print(res[i]['generated_text'])

In [None]:
start=["The black hole started to glow",
       "When Donald Trump entered the room",
       "Angela Merkel went to Washington",
       "Admiral Nelson"
      ]

In [None]:
for st in start:
    res = generator(st, max_length=100, num_return_sequences=3, do_sample=True)
    print("="*100)
    for r in res:
        print("-"*50)
        print(r['generated_text'],"\n")

## GPTNeo 1.3B

In [None]:
from numba import cuda
device = cuda.get_current_device()
device.reset()

In [None]:
from transformers import GPTNeoForCausalLM, GPT2Tokenizer

model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")



In [None]:
prompt = (
    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
    "researchers was the fact that the unicorns spoke perfect English."
)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

Recal running on 6 CPUs

In [None]:
gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.9,
    max_length=500,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text,"\n")

## GPT-J

needs 48GB RAM to load the model

In [None]:
from numba import cuda
device = cuda.get_current_device()
device.reset()

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")



In [None]:
prompt = (
    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
    "researchers was the fact that the unicorns spoke perfect English."
)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids


Generate works only on CPU

In [None]:
%%time
gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.9,
    max_length=200,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]

print(gen_text,"\n")