# Next Word Prediction using GPT 2

## A Predict-Next-Word Example Using Hugging Face and GPT-2

-  [Quelle 1](https://jamesmccaffrey.wordpress.com/2021/10/21/a-predict-next-word-example-using-hugging-face-and-gpt-2/)

### Imports

In [2]:
#! pip install torch
#! pip install transformers
#! pip install numpy 


Collecting torch
  Using cached torch-2.4.1-cp311-cp311-manylinux1_x86_64.whl.metadata (26 kB)
Collecting filelock (from torch)
  Using cached filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting sympy (from torch)
  Using cached sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch)
  Using cached networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
Collecting fsspec (from torch)
  Using cached fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Using

### Next word prediciton 

In [6]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
#from torch import nn
import numpy as np

print("\nBegin next-word using HF GPT-2 demo ")

toker = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

seq = "Machine learning with PyTorch can do amazing"
print("\nInput sequence: ")
print(seq)

inpts = toker(seq, return_tensors="pt")
print("\nTokenized input data structure: ")
print(inpts)

inpt_ids = inpts["input_ids"]  # just IDS, no attn mask
print("\nToken IDs and their words: ")
for id in inpt_ids[0]:
  word = toker.decode(id)
  print(id, word)

with torch.no_grad():
  logits = model(**inpts).logits[:, -1, :]
print("\nAll logits for next word: ")
print(logits)
print(logits.shape)

pred_id = torch.argmax(logits).item()
print("\nPredicted token ID of next word: ")
print(pred_id)

pred_word = toker.decode(pred_id)
print("\nPredicted next word for sequence: ")
print(pred_word)

print("\nEnd demo ")


Begin next-word using HF GPT-2 demo 

Input sequence: 
Machine learning with PyTorch can do amazing

Tokenized input data structure: 
{'input_ids': tensor([[37573,  4673,   351,  9485, 15884,   354,   460,   466,  4998]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Token IDs and their words: 
tensor(37573) Machine
tensor(4673)  learning
tensor(351)  with
tensor(9485)  Py
tensor(15884) Tor
tensor(354) ch
tensor(460)  can
tensor(466)  do
tensor(4998)  amazing

All logits for next word: 
tensor([[-114.9652, -118.0908, -123.3014,  ..., -124.5989, -127.7998,
         -118.4347]])
torch.Size([1, 50257])

Predicted token ID of next word: 
1243

Predicted next word for sequence: 
 things

End demo 


## Top 10 predicted

In [7]:
print("\nBegin next-word using HF GPT-2 demo ")

# Lade GPT-2 Model und Tokenizer
toker = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

seq = "Machine learning with PyTorch can do amazing"
print("\nInput sequence: ")
print(seq)

# Tokenisiere den Eingabesatz
inpts = toker(seq, return_tensors="pt")
print("\nTokenized input data structure: ")
print(inpts)

inpt_ids = inpts["input_ids"]  # Nur die Token-IDs
print("\nToken IDs and their words: ")
for id in inpt_ids[0]:
    word = toker.decode(id)
    print(id, word)

# Berechne die Logits
with torch.no_grad():
    logits = model(**inpts).logits[:, -1, :]
print("\nAll logits for next word: ")
print(logits)
print(logits.shape)

# Top-10 wahrscheinlichste Wörter auswählen
top_k = 10
top_k_probs = torch.topk(logits, top_k)
top_k_ids = top_k_probs.indices[0].tolist()

print(f"\nTop {top_k} predicted token IDs for the next word: ")
print(top_k_ids)

print(f"\nTop {top_k} predicted next words for the sequence: ")
top_k_words = [toker.decode(pred_id) for pred_id in top_k_ids]
print(top_k_words)

print("\nEnd demo ")




Begin next-word using HF GPT-2 demo 

Input sequence: 
Machine learning with PyTorch can do amazing

Tokenized input data structure: 
{'input_ids': tensor([[37573,  4673,   351,  9485, 15884,   354,   460,   466,  4998]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Token IDs and their words: 
tensor(37573) Machine
tensor(4673)  learning
tensor(351)  with
tensor(9485)  Py
tensor(15884) Tor
tensor(354) ch
tensor(460)  can
tensor(466)  do
tensor(4998)  amazing

All logits for next word: 
tensor([[-114.9652, -118.0908, -123.3014,  ..., -124.5989, -127.7998,
         -118.4347]])
torch.Size([1, 50257])

Top 10 predicted token IDs for the next word: 
[1243, 670, 3404, 1517, 35664, 15910, 11, 8861, 1693, 2482]

Top 10 predicted next words for the sequence: 
[' things', ' work', ' stuff', ' thing', ' feats', ' tricks', ',', ' tasks', ' job', ' results']

End demo 


## Top 10 predicted in German

In [9]:

print("\nBegin next-word prediction using German GPT-2 model")

# Lade den deutschen GPT-2 Tokenizer und das Modell
toker = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
model = AutoModelForCausalLM.from_pretrained("dbmdz/german-gpt2")

# Beispiel-Eingabesequenz auf Deutsch
seq = "ich habe hunger und koche jetzt"
print("\nInput sequence: ")
print(seq)

# Tokenisiere den Text
inpts = toker(seq, return_tensors="pt")
print("\nTokenized input data structure: ")
print(inpts)

# Token-IDs anzeigen
inpt_ids = inpts["input_ids"]
print("\nToken IDs and their words: ")
for id in inpt_ids[0]:
    word = toker.decode(id)
    print(id, word)

# Vorhersage für das nächste Token
with torch.no_grad():
    logits = model(**inpts).logits[:, -1, :]
print("\nAll logits for next word: ")
print(logits.shape)

# Top-10 wahrscheinlichste Wörter auswählen
top_k = 10
top_k_probs = torch.topk(logits, top_k)
top_k_ids = top_k_probs.indices[0].tolist()

print(f"\nTop {top_k} predicted token IDs for the next word: ")
print(top_k_ids)

# Top-10 Wörter dekodieren
print(f"\nTop {top_k} predicted next words for the sequence: ")
top_k_words = [toker.decode(pred_id) for pred_id in top_k_ids]
print(top_k_words)

print("\nEnd demo ")


Begin next-word prediction using German GPT-2 model

Input sequence: 
ich habe hunger und koche jetzt

Tokenized input data structure: 
{'input_ids': tensor([[ 277,  865,  315, 7442,  292,  339, 6738, 1333]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

Token IDs and their words: 
tensor(277) ich
tensor(865)  habe
tensor(315)  h
tensor(7442) unger
tensor(292)  und
tensor(339)  k
tensor(6738) oche
tensor(1333)  jetzt

All logits for next word: 
torch.Size([1, 50265])

Top 10 predicted token IDs for the next word: 
[18, 412, 16, 5, 387, 472, 292, 633, 362, 941]

Top 10 predicted next words for the sequence: 
['.', ' nicht', ',', '!', ' für', ' auch', ' und', ' nur', ' zu', ' schon']

End demo 
