# Next Word Prediction using GPT 2

## Load a pre-trained Large Language Model (LLM) - GPT-2 model (originally invented by OpenAI), finetune it to a specific text style, optimize it and convert it to TensorFlow Lite

-  [Quelle](https://colab.research.google.com/github/tensorflow/codelabs/blob/main/KerasNLP/io2023_workshop.ipynb#scrollTo=hkj4Tl_gL9by)

### Imports

In [1]:
! pip install keras_nlp
! pip install tensorflow
! pip install tensorflow_datasets 
! pip install tensorflow_text

[0m

In [2]:
import numpy as np
import keras_nlp
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_text as tf_text
from tensorflow import keras
from tensorflow.lite.python import interpreter
import time

2024-10-14 09:09:02.381001: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-14 09:09:02.410566: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-14 09:09:02.438952: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-14 09:09:02.455214: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-14 09:09:02.523179: I tensorflow/core/platform/cpu_feature_guar

### Genaerate some Text

In [3]:

gpt2_tokenizer = keras_nlp.models.GPT2Tokenizer.from_preset("gpt2_base_en")
gpt2_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=256,
    add_end_token=True,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset("gpt2_base_en", preprocessor=gpt2_preprocessor)

In [4]:
modle_version = "gpt2_base_en"
tokenizer_path = "./models/{modle_version}tokenizer"
preprocessor_path = "./models/{modle_version}preprocessor"
model_path = "./models/{modle_version}model"

In [5]:
import os
from keras_nlp.models import GPT2Tokenizer, GPT2CausalLMPreprocessor, GPT2CausalLM

# Funktion zum Laden des Tokenizers
def load_tokenizer():
    try:
        print("Versuche, Tokenizer von lokalem Pfad zu laden...")
        return GPT2Tokenizer.from_pretrained(tokenizer_path)
    except Exception as e:
        print(f"Fehler beim Laden des Tokenizers: {e}. Lade Tokenizer herunter...")
        tokenizer = GPT2Tokenizer.from_pretrained("gpt2_base_en")
        tokenizer.save_pretrained(tokenizer_path)  # Speichern für die zukünftige Verwendung
        return tokenizer

# Funktion zum Laden des Preprocessors
def load_preprocessor():
    try:
        print("Versuche, Preprocessor von lokalem Pfad zu laden...")
        return GPT2CausalLMPreprocessor.from_pretrained(preprocessor_path)
    except Exception as e:
        print(f"Fehler beim Laden des Preprocessors: {e}. Lade Preprocessor herunter...")
        preprocessor = GPT2CausalLMPreprocessor.from_preset(
            "gpt2_base_en",
            sequence_length=256,
            add_end_token=True,
        )
        preprocessor.save_pretrained(preprocessor_path)  # Speichern für die zukünftige Verwendung
        return preprocessor

# Funktion zum Laden des Modells
def load_model():
    try:
        print("Versuche, Modell von lokalem Pfad zu laden...")
        return GPT2CausalLM.load(model_path)
    except Exception as e:
        print(f"Fehler beim Laden des Modells: {e}. Lade Modell herunter...")
        model = GPT2CausalLM.from_preset("gpt2_base_en", preprocessor=load_preprocessor())
        model.save(model_path)  # Speichern für die zukünftige Verwendung
        return model

# Verwendung der Funktionen
gpt2_tokenizer = load_tokenizer()
gpt2_preprocessor = load_preprocessor()
gpt2_lm = load_model()


Versuche, Tokenizer von lokalem Pfad zu laden...
Fehler beim Laden des Tokenizers: type object 'GPT2Tokenizer' has no attribute 'from_pretrained'. Lade Tokenizer herunter...


AttributeError: type object 'GPT2Tokenizer' has no attribute 'from_pretrained'

In [4]:
start = time.time()

output = gpt2_lm.generate("My trip to Yosemite was", max_length=100)
print("\nGPT-2 output:")
print(output)  # print(output.numpy().decode("utf-8"))

end = time.time()
print("TOTAL TIME ELAPSED: ", end - start)


2024-10-08 10:49:08.166630: E tensorflow/core/util/util.cc:131] oneDNN supports DT_INT64 only on platforms with AVX-512. Falling back to the default Eigen-based implementation if present.
I0000 00:00:1728384555.423890   17265 service.cc:146] XLA service 0x7f175c02d750 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1728384555.425998   17265 service.cc:154]   StreamExecutor device (0): Host, Default Version
I0000 00:00:1728384555.761278   17268 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.



GPT-2 output:
My trip to Yosemite was a bit of a whirlwind, as I didn't have the time to do anything else. I was able to get some food, some drinks, and some time to relax and get some sleep.

I was lucky enough to have a nice view of the park and the Yosemite Valley. I had the opportunity to visit a few different sites and see some incredible sights. The views are amazing and the weather was good.

I was able to get a little more
TOTAL TIME ELAPSED:  18.59576392173767


## Next word prediciton 

In [12]:
import keras_nlp
import numpy as np

# Beispiel-Text
input_text = ["Today is a beautiful "]

# Vorhersage der Logits mit predict()
prediction_logits = gpt2_lm.predict(input_text)

# Ausgabe der Logits-Form
print(f"Logits shape: {prediction_logits.shape}")

# Logits des letzten Tokens extrahieren
last_token_logits = prediction_logits[0, -1, :]  # Letzter Token in der Sequenz

# Top 10 wahrscheinlichste nächsten Token
top_k = 10
top_k_indices = np.argsort(last_token_logits)[-top_k:][::-1]  # Sortiere und wähle Top 10

# Verwende den Tokenizer, um die Token-IDs in Wörter zu dekodieren und bereinige sie
top_k_words = [gpt2_lm.preprocessor.tokenizer.id_to_token(token_id).replace('Ġ', '').replace('Ċ', '') for token_id in top_k_indices]

# Ausgabe der Top 10 nächsten Wörter
print("\nTop 10 predicted next words:")
for i, word in enumerate(top_k_words):
    print(f"{i + 1}: {word}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 15s/step
Logits shape: (1, 256, 50257)

Top 10 predicted next words:
1: 
2: The
3: I
4: 
5: This
6: A
7: We
8: (
9: Please
10: [


## Top 10 deutsche sprache 

In [6]:
! pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.17.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.17.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: tf-keras
Successfully installed tf-keras-2.17.0
[0m

In [7]:
import tensorflow as tf
from transformers import TFAutoModelForCausalLM, AutoTokenizer
import numpy as np

# Lade den deutschen GPT-2 Tokenizer und das Modell
tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
model = TFAutoModelForCausalLM.from_pretrained("dbmdz/german-gpt2")

# Beispiel-Text in Deutsch
input_text = "Heute ist ein schöner Tag"

# Tokenisierung des Eingabetexts
inputs = tokenizer(input_text, return_tensors="tf")

# Vorhersage der Logits mit dem GPT-2 Modell in TensorFlow
outputs = model(**inputs)
logits = outputs.logits

# Extrahiere die Logits des letzten Tokens
last_token_logits = logits[:, -1, :]

# Top 10 wahrscheinlichste nächsten Token
top_k = 10
top_k_indices = tf.math.top_k(last_token_logits, k=top_k).indices.numpy()[0]

# Dekodiere die Token-IDs in Wörter
top_k_words = [tokenizer.decode([token_id]).strip() for token_id in top_k_indices]

# Ausgabe der Top 10 nächsten Wörter
print("\nTop 10 vorhergesagte nächste Wörter:")
for i, word in enumerate(top_k_words):
    print(f"{i + 1}: {word}")



config.json:   0%|          | 0.00/865 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.43M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/510M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['transformer.h.4.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.2.attn.masked_bias']
- This IS expected if you are initializing TFGPT2LMHeadModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFGPT2LMHeadModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All t


Top 10 vorhergesagte nächste Wörter:
1: ,
2: .
3: für
4: zum
5: !
6: und
7: in
8: ",
9: ...
10: um
