# Next Word Prediction using GPT 2

## A Predict-Next-Word Example Using Hugging Face and GPT-2

-  [Quelle 1](https://jamesmccaffrey.wordpress.com/2021/10/21/a-predict-next-word-example-using-hugging-face-and-gpt-2/)

### Imports

In [1]:
#! pip install torch
#! pip install transformers
#! pip install numpy 


### Next word prediciton 

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
#from torch import nn
import numpy as np

print("\nBegin next-word using HF GPT-2 demo ")

toker = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

seq = "Machine learning with PyTorch can do amazing"
print("\nInput sequence: ")
print(seq)

inpts = toker(seq, return_tensors="pt")
print("\nTokenized input data structure: ")
print(inpts)

inpt_ids = inpts["input_ids"]  # just IDS, no attn mask
print("\nToken IDs and their words: ")
for id in inpt_ids[0]:
  word = toker.decode(id)
  print(id, word)

with torch.no_grad():
  logits = model(**inpts).logits[:, -1, :]
print("\nAll logits for next word: ")
print(logits)
print(logits.shape)

pred_id = torch.argmax(logits).item()
print("\nPredicted token ID of next word: ")
print(pred_id)

pred_word = toker.decode(pred_id)
print("\nPredicted next word for sequence: ")
print(pred_word)

print("\nEnd demo ")


Begin next-word using HF GPT-2 demo 


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

2024-11-14 12:54:57.610532: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-14 12:55:00.073463: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-14 12:55:00.073662: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-14 12:55:00.526064: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-14 12:55:01.522796: I tensorflow/core/platform/cpu_feature_guar

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]


Input sequence: 
Machine learning with PyTorch can do amazing

Tokenized input data structure: 
{'input_ids': tensor([[37573,  4673,   351,  9485, 15884,   354,   460,   466,  4998]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Token IDs and their words: 
tensor(37573) Machine
tensor(4673)  learning
tensor(351)  with
tensor(9485)  Py
tensor(15884) Tor
tensor(354) ch
tensor(460)  can
tensor(466)  do
tensor(4998)  amazing

All logits for next word: 
tensor([[-114.9652, -118.0908, -123.3014,  ..., -124.5989, -127.7998,
         -118.4347]])
torch.Size([1, 50257])

Predicted token ID of next word: 
1243

Predicted next word for sequence: 
 things

End demo 


## Top 10 predicted

In [8]:
print("\nBegin next-word using HF GPT-2 demo ")

# Lade GPT-2 Model und Tokenizer
toker = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

seq = "i am "
print("\nInput sequence: ")
print(seq)

# Tokenisiere den Eingabesatz
inpts = toker(seq, return_tensors="pt")
print("\nTokenized input data structure: ")
print(inpts)

inpt_ids = inpts["input_ids"]  # Nur die Token-IDs
print("\nToken IDs and their words: ")
for id in inpt_ids[0]:
    word = toker.decode(id)
    print(id, word)

# Berechne die Logits
with torch.no_grad():
    logits = model(**inpts).logits[:, -1, :]
print("\nAll logits for next word: ")
print(logits)
print(logits.shape)

# Top-10 wahrscheinlichste Wörter auswählen
top_k = 10
top_k_probs = torch.topk(logits, top_k)
top_k_ids = top_k_probs.indices[0].tolist()

print(f"\nTop {top_k} predicted token IDs for the next word: ")
print(top_k_ids)

print(f"\nTop {top_k} predicted next words for the sequence: ")
top_k_words = [toker.decode(pred_id) for pred_id in top_k_ids]
print(top_k_words)

print("\nEnd demo ")




Begin next-word using HF GPT-2 demo 

Input sequence: 
i am 

Tokenized input data structure: 
{'input_ids': tensor([[ 72, 716, 220]]), 'attention_mask': tensor([[1, 1, 1]])}

Token IDs and their words: 
tensor(72) i
tensor(716)  am
tensor(220)  

All logits for next word: 
tensor([[-59.5626, -60.9692, -63.4266,  ..., -70.3573, -67.7269, -64.5714]])
torch.Size([1, 50257])

Top 10 predicted token IDs for the next word: 
[488, 1134, 425, 522, 1849, 576, 544, 2474, 29773, 528]

Top 10 predicted next words for the sequence: 
['ich', 'ik', 'ive', 'ike', '\xa0', 'ile', 'ia', '!"', '�', 'iz']

End demo 


## Top 10 predicted in German

In [None]:

print("\nBegin next-word prediction using German GPT-2 model")

# Lade den deutschen GPT-2 Tokenizer und das Modell
toker = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
model = AutoModelForCausalLM.from_pretrained("dbmdz/german-gpt2")

# Beispiel-Eingabesequenz auf Deutsch
seq = "kannst du mir etwas "
print("\nInput sequence: ")
print(seq)

# Tokenisiere den Text
inpts = toker(seq, return_tensors="pt")
print("\nTokenized input data structure: ")
print(inpts)

# Token-IDs anzeigen
inpt_ids = inpts["input_ids"]
print("\nToken IDs and their words: ")
for id in inpt_ids[0]:
    word = toker.decode(id)
    print(id, word)

# Vorhersage für das nächste Token
with torch.no_grad():
    logits = model(**inpts).logits[:, -1, :]
print("\nAll logits for next word: ")
print(logits.shape)

# Top-10 wahrscheinlichste Wörter auswählen
top_k = 10
top_k_probs = torch.topk(logits, top_k)
top_k_ids = top_k_probs.indices[0].tolist()

print(f"\nTop {top_k} predicted token IDs for the next word: ")
print(top_k_ids)

# Top-10 Wörter dekodieren
print(f"\nTop {top_k} predicted next words for the sequence: ")
top_k_words = [toker.decode(pred_id) for pred_id in top_k_ids]
print(top_k_words)



Begin next-word prediction using German GPT-2 model

Input sequence: 
ikannst du mir etwas 

Tokenized input data structure: 
{'input_ids': tensor([[ 424,  390,  268,  671,  943, 1410,  225]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

Token IDs and their words: 
tensor(424) ik
tensor(390) ann
tensor(268) st
tensor(671)  du
tensor(943)  mir
tensor(1410)  etwas
tensor(225)  

All logits for next word: 
torch.Size([1, 50265])

Top 10 predicted token IDs for the next word: 
[2462, 5664, 1353, 871, 824, 140, 5822, 16503, 2663, 4090]

Top 10 predicted next words for the sequence: 
['reiben', 'riech', 'ruck', 'öff', 'reis', '�', 'uff', 'räu', 'tiger', 'reibung']
