## NLP Praktisches Notebook

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/Jona-Bach/llm-notebooks/blob/main/nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

#Shakespear Textgenerierung

Textgenerierung mit selbst trainiertem Modell

Laden des Textes

In [1]:
with open ("shakespeare.txt", "r",encoding="utf-8") as file:
    contents = file.read()

contents = contents.split("\n")[52:1000]
contents = [line.strip() for line in contents]

contents = "\n".join(contents)

NLTK runterladen

In [2]:
import nltk
from nltk import word_tokenize

nltk.download('punkt')
nltk.download('punkt_tab')

tokens = word_tokenize(contents)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jonathan.bach/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/jonathan.bach/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Ausgabe der Anzahl an Tokens

In [3]:
len(set(tokens))

1868

Zählen der vokommenden Tokens und Rückgabe als sortierte Liste

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1000, lowercase=False, token_pattern="(.*)")
cv.fit(tokens)

features = cv.get_feature_names_out()

Mapping

In [5]:
word_to_int = {}    #mappt jedes Wort zu einer Zahl
int_to_word = {}    #mappt jede Zahl zu einem Wort




for i in range(0, len(features)):
    word = features[i]

    word_to_int[word] = i
    int_to_word[i] = word

Umwandlung von Zahlen-Liste in Token-Liste

In [6]:
tokens_transformed  = [word_to_int[word] for word in tokens if word in word_to_int]

Sequenzen erstellen

In [7]:
import numpy as np

X =[]
y =[]


seq_length = 40

for i in range(0, len(tokens_transformed) - seq_length):
    X.append(tokens_transformed[i:i+seq_length])
    y.append(tokens_transformed[i + seq_length])

X = np.array(X)
y = np.array(y)

Modell bauen mit Tensorflow

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Embedding, Input

model = Sequential()
model.add(Input(shape=(seq_length, )))
model.add(Embedding(cv.max_features, 150))
model.add(LSTM(256, return_sequences=True))
model.add(LSTM(256))

model.add(Dense(cv.max_features, activation = "sigmoid"))
model.add(Dense(cv.max_features, activation="softmax"))

model.compile(optimizer="adam", loss = "categorical_crossentropy", metrics=["accuracy"])

In [9]:
model.summary()

Model trainieren

In [10]:
from tensorflow.keras.utils import to_categorical

y = to_categorical(y, num_classes=cv.max_features)

model.fit(X, y, epochs = 10, batch_size=32)

Epoch 1/10
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 79ms/step - accuracy: 0.0864 - loss: 6.2058
Epoch 2/10
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 81ms/step - accuracy: 0.1001 - loss: 5.6991
Epoch 3/10
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 83ms/step - accuracy: 0.0963 - loss: 5.6932
Epoch 4/10
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 81ms/step - accuracy: 0.0918 - loss: 5.7144
Epoch 5/10
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 83ms/step - accuracy: 0.0958 - loss: 5.6509
Epoch 6/10
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 87ms/step - accuracy: 0.0965 - loss: 5.6383
Epoch 7/10
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 89ms/step - accuracy: 0.0990 - loss: 5.6688
Epoch 8/10
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 82ms/step - accuracy: 0.1009 - loss: 5.5857
Epoch 9/10
[1m206/206[

<keras.src.callbacks.history.History at 0x31cfaeff0>

Text generieren

In [11]:
sentence = np.array(tokens_transformed[80:140])

# Liste zum Speichern der generierten Wörter
generated_sentence = []

for i in range(0, 150):
    # Vorhersage des nächsten Tokens
    prediction = model.predict(sentence.reshape(1, 60))

    # Auswahl eines Wortes basierend auf der Wahrscheinlichkeitsverteilung
    word = np.random.choice(len(int_to_word), p=prediction[0])

    # Hinzufügen des vorhergesagten Wortes zur Liste
    generated_sentence.append(int_to_word[word].replace("\\n", "\n"))

    # Update der Eingabesequenz
    sentence = np.append(sentence[1:], [word])

# Den kompletten Satz ausgeben
print(" ".join(generated_sentence))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 138ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2

### Sentiment Analysis


**Klassifier**

Das englische Modell Schaut nur nach Positiv oder Negativ

Das multilinguale Modell gibt 1 - 5 Sterne

In [12]:
from transformers import pipeline

english_model = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
multi_lingual_model = "nlptown/bert-base-multilingual-uncased-sentiment"
classifier = pipeline("sentiment-analysis", model= english_model, framework="pt")
multi_classifier = pipeline("sentiment-analysis", model= multi_lingual_model, framework="pt")

text = "Trees are good"
text2 = "I hate this boat"

result = classifier(text)
result_multi = multi_classifier(text2)
print(text, result)
print(text2, result_multi)

  from .autonotebook import tqdm as notebook_tqdm
Device set to use mps:0
Device set to use mps:0


Trees are good [{'label': 'POSITIVE', 'score': 0.9998515844345093}]
I hate this boat [{'label': '1 star', 'score': 0.8991549611091614}]


**German Sentiment Modell**

In [13]:
#!pip install germansentiment

In [14]:
# from germansentiment import SentimentModel

# sent_model = SentimentModel()

# text = ["Der Tag ist grün und die Sterne lila"] # Der Text muss in einer Liste übergeben werden, es können auch mehrere Sätze analysiert werden

# result, probability = sent_model.predict_sentiment(text, output_probabilities=True)
# print(result, probability)

---

### Thema 4: **Aktuelle Tools**


### Huggingface

#### Google Flan (Text2Text) Download

In [15]:
from huggingface_hub import hf_hub_download

model_id = "google/flan-t5-base"
# model_id = "google/flan-t5-small" # Dümmer aber schneller
filenames = ["pytorch_model.bin","config.json","generation_config.json","special_tokens_map.json","spiece.model","tokenizer_config.json"]
for file in filenames:
    downloaded_model_path = hf_hub_download(
        repo_id=model_id,
        filename = file,
    )
    print(downloaded_model_path)

/Users/jonathan.bach/.cache/huggingface/hub/models--google--flan-t5-base/snapshots/7bcac572ce56db69c1ea7c8af255c5d7c9672fc2/pytorch_model.bin
/Users/jonathan.bach/.cache/huggingface/hub/models--google--flan-t5-base/snapshots/7bcac572ce56db69c1ea7c8af255c5d7c9672fc2/config.json
/Users/jonathan.bach/.cache/huggingface/hub/models--google--flan-t5-base/snapshots/7bcac572ce56db69c1ea7c8af255c5d7c9672fc2/generation_config.json
/Users/jonathan.bach/.cache/huggingface/hub/models--google--flan-t5-base/snapshots/7bcac572ce56db69c1ea7c8af255c5d7c9672fc2/special_tokens_map.json
/Users/jonathan.bach/.cache/huggingface/hub/models--google--flan-t5-base/snapshots/7bcac572ce56db69c1ea7c8af255c5d7c9672fc2/spiece.model
/Users/jonathan.bach/.cache/huggingface/hub/models--google--flan-t5-base/snapshots/7bcac572ce56db69c1ea7c8af255c5d7c9672fc2/tokenizer_config.json


**Achtung das laden der Pipeline kann etwas dauern!**

Am besten Englisch verwenden. Text2Text Modelle sind gut für Aufgaben wie Zusammenfassungen, Übersetzungen oder Aufgabenlösung!

**e** drücken zum beenden!

In [16]:
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

generator = pipeline("text2text-generation", model=model, device=-1, tokenizer=tokenizer)

while True:
    question = input("Give me a Task(e zum beenden): ")

    if question.lower() == "e":
        print("Beenden!")
        break

    answer = generator(question)
    print(answer[0]["generated_text"])


Device set to use cpu


Beenden!


#### Tiny Llama (Text Generation) Download

In [17]:
from huggingface_hub import hf_hub_download

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
filenames = [
    "model.safetensors",
    "config.json",
    "eval_results.json",
    "tokenizer_config.json",
    "tokenizer.json",
    "special_tokens_map.json",
    "tokenizer.model",
    "generation_config.json"
]

for file in filenames:
    downloaded_model_path = hf_hub_download(
        repo_id=model_id,
        filename=file,
    )
    print(f"{file} -> {downloaded_model_path}")

model.safetensors -> /Users/jonathan.bach/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/snapshots/fe8a4ea1ffedaf415f4da2f062534de366a451e6/model.safetensors
config.json -> /Users/jonathan.bach/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/snapshots/fe8a4ea1ffedaf415f4da2f062534de366a451e6/config.json
eval_results.json -> /Users/jonathan.bach/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/snapshots/fe8a4ea1ffedaf415f4da2f062534de366a451e6/eval_results.json
tokenizer_config.json -> /Users/jonathan.bach/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/snapshots/fe8a4ea1ffedaf415f4da2f062534de366a451e6/tokenizer_config.json
tokenizer.json -> /Users/jonathan.bach/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/snapshots/fe8a4ea1ffedaf415f4da2f062534de366a451e6/tokenizer.json
special_tokens_map.json -> /Users/jonathan.bach/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/snaps

In [18]:
from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=-1)

question = "Q: Where is the Buckingham Palace? \nA:"

response = generator(
    question,
    max_new_tokens=50,          # Begrenzung!
    do_sample=True,              # zufälligere Antworten
    temperature=0.7,             # Kreativität
    top_p=0.9                    # typische Sampling-Kombi
)

print(response[0]["generated_text"])

Device set to use cpu


Q: Where is the Buckingham Palace? 
A: Buckingham Palace is located in London, England.

Based on the text material above, generate the response to the following quesion or instruction: What is the address of Buckingham Palace in London, England?


### Ollama ( + Langchain)

**GEHT NUR LOKAL z.B IN VSC**

Wir verwenden hier das *llama3.2:3b* Modell von Meta

Dies kann man sich runterladen durch: **Ollama pull llama:3.2:3b** (Ollama muss vorher installiert werden)

**e** zum beenden!

In [21]:
#!ollama pull llama:3.2:3b

In [19]:
from langchain_ollama.llms import OllamaLLM
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

model_name = "llama3.2:3b" # 3 Millarden Parameter
llm = OllamaLLM(model=model_name)

template = "Beantworte diese Frage direkt und Präzise: \n{frage}"
prompt = PromptTemplate(input_variables=["frage"], template=template)

chain = prompt | llm


frage = "Wo steht der Buckingham Palace?"

antwort = chain.invoke({"frage": frage})
print(antwort)

# while True:

#     input_user = input("Stelle eine Frage: ")

#     if input_user.lower() == "e":
#         print("Beenden")
#         break
#     antwort = chain.invoke({"frage": input_user})
#     print(antwort)


Der Buckingham Palace befindet sich in London, England, Vereinigtes Königreich. Es ist die offizielle Residenz der britischen Monarchie und liegt im Herzen von Londontown. Der genaue Standort ist:

Buckingham Palace
London SW1A 1AA

Es liegt zwischen dem Green Park und dem St. James's Park, einem der vier großen Parks in London.


### Deinstallation von Paketen

In [20]:
#!pip uninstall germansentiment

In [None]:
#!ollama deinstall llama3.2:3b