# Szöveggenerálás

A notebook runtime típusa GPU legyen!

In [1]:
!nvidia-smi

Mon Jan 20 11:48:48 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Magyar nyelvű LLM generárotok

https://juniper.nytud.hu/demo/puli

https://huggingface.co/NYTK

## Szöveggenerálás

Vegyük példaként a Huggingface-ről elérhető, előtanított GPT-2 modellt, töltsük be a hozzá tartozó tokenizálót is.

Célszerű az alábbiakhoz GPU-s runtime-ra váltani

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

tokenizer.pad_token = tokenizer.eos_token

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

model = model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Using device: cuda


A tokenizáló működése:

In [3]:
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

print(input_ids, "\n")

for id in input_ids[0]:
  print(id, tokenizer.decode(id, skip_special_tokens=True))

tensor([[7454, 2402,  257,  640]], device='cuda:0') 

tensor(7454, device='cuda:0') Once
tensor(2402, device='cuda:0')  upon
tensor(257, device='cuda:0')  a
tensor(640, device='cuda:0')  time


A generálás folyamata, és egy lépése:

In [4]:
output = model.generate(input_ids,
                        max_length=50,
                        num_return_sequences=1,
                        pad_token_id=tokenizer.pad_token_id)


generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a


In [5]:
input_text = "The spiderman was"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

import torch.nn.functional as F

with torch.no_grad():
    outputs = model(input_ids)

# logit-ok az utolsó kimeneti tokenre
last_token_logits = outputs.logits[0, -1, :]

# alkalmazzunk softmax-et, hogy valószínűségi értékeket kapjunk
probs = F.softmax(last_token_logits, dim=-1)

# az 5 legmagasabb valószínűségi értékkel rendelkező token
top_k_probs, top_k_indices = torch.topk(probs, 5)

top_k_tokens = [tokenizer.decode([token]) for token in top_k_indices]

for i, (token, prob) in enumerate(zip(top_k_tokens, top_k_probs)):
    print(f"Top {i+1} token: '{token}', probability:", round(prob.item(), 3), "-->", input_text + f"{token}")

Top 1 token: ' a', probability: 0.072 --> The spiderman was a
Top 2 token: ' able', probability: 0.024 --> The spiderman was able
Top 3 token: ' the', probability: 0.024 --> The spiderman was the
Top 4 token: ' also', probability: 0.023 --> The spiderman was also
Top 5 token: ' not', probability: 0.018 --> The spiderman was not


In [6]:
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

output = model.generate(input_ids,
                        max_length=50,
                        num_return_sequences=1,
                        pad_token_id=tokenizer.pad_token_id,
                        do_sample=True,
                        top_k=0,
                        temperature=1.0)


generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

Once upon a time I thought I would leap into General Savage's tent, all the way over. Yes, I was joking, don't you? It was no surprise, I thought. There was a tremendous thing there with soldiers who did away with


## Generatív Fine-tuning

Szeretnénk a fenti modellt finomhangolni arra, hogy "rossz online értékeléseket" írjon. Ehhez egy online szöveges értékeléseket tartalmazó adatbázisból használjuk az 1 és 2 csillagos review-kat.

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
import pandas as pd
from datasets import Dataset

df = pd.read_parquet("hf://datasets/Yelp/yelp_review_full/yelp_review_full/train-00000-of-00001.parquet")
df

Unnamed: 0,label,text
0,4,dr. goldberg offers everything i look for in a...
1,1,"Unfortunately, the frustration of being Dr. Go..."
2,3,Been going to Dr. Goldberg for over 10 years. ...
3,3,Got a letter in the mail last week that said D...
4,0,I don't know what Dr. Goldberg was like before...
...,...,...
649995,4,I had a sprinkler that was gushing... pipe bro...
649996,0,Phone calls always go to voicemail and message...
649997,0,Looks like all of the good reviews have gone t...
649998,4,I was able to once again rely on Yelp to provi...


In [None]:
df = df[df.label < 2]
df = df[["text"]].iloc[:10_000]

dataset = Dataset.from_pandas(df)
dataset = dataset.remove_columns("__index_level_0__")

def tokenize_function(examples):
    encoding = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)
    encoding['labels'] = encoding['input_ids'].copy()
    return encoding

tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Az alábbi lépés, a finomhangolás folyamata GPU-n körülbelül 5 percet vesz igénybe.

In [None]:
import os
from transformers import Trainer, TrainingArguments

os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=16,
    logging_dir='./logs',
    logging_steps=50,
    report_to=[],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
)

trainer.train()

trainer.save_model('./results')

Step,Training Loss
50,3.0641
100,2.9133
150,2.7995
200,2.875
250,2.8469
300,2.8859
350,2.8143
400,2.8971
450,2.861
500,2.8414


Nézzük meg, hogy az eredeti és a finomhangolt modell mit generál, ha "The restaurant" szavakkal promptoljuk be őket.

In [None]:
input_text = "The restaurant"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

model = AutoModelForCausalLM.from_pretrained('gpt2')
model = model.to(device)

pre_trained_output = model.generate(input_ids, max_length=25, pad_token_id=tokenizer.pad_token_id)
pre_trained_text = tokenizer.decode(pre_trained_output[0], skip_special_tokens=True)
print("Pre-trained model output:")
print(pre_trained_text)



fine_tuned_model = AutoModelForCausalLM.from_pretrained('./results').to(device)

fine_tuned_output = fine_tuned_model.generate(input_ids, max_length=50, pad_token_id=tokenizer.pad_token_id, do_sample=True, temperature=0.7 )
fine_tuned_text = tokenizer.decode(fine_tuned_output[0], skip_special_tokens=True)
print("\nFine-tuned model output:")
print(fine_tuned_text)
#

Pre-trained model output:
The restaurant is located in the heart of the city's downtown, and is open daily from 9 a.m. to 5

Fine-tuned model output:
The restaurant is awesome but the food is horrible. I don't know how to get a good bite at this place. The food is good. I would not recommend it to anyone.


# In-context learning

Az in-context "few-shot learning" azt jelenti, hogy egy szöveges propmt-ban néhány példát feltüntetve a modell képes általánosítani az új, még nem látott példákra.

## Szentiment osztályozási feladat

In [None]:
import pandas as pd

df = pd.read_parquet("hf://datasets/Yelp/yelp_review_full/yelp_review_full/train-00000-of-00001.parquet")

neg = df[df.label == 0][:1000]
pos = df[df.label == 4][:1000]
df = pd.concat([neg, pos])
df

Unnamed: 0,label,text
4,0,I don't know what Dr. Goldberg was like before...
7,0,I'm writing this review to give you a heads up...
10,0,Owning a driving range inside the city limits ...
11,0,This place is absolute garbage... Half of the...
24,0,"Used to go there for tires, brakes, etc. Thei..."
...,...,...
5727,4,"Amazing habanero shrimp appetizer, awesome Sou..."
5738,4,David Wichnoski is the best. I woke up a few ...
5739,4,I recently visited this eye doctor after movin...
5740,4,Much like the others that have reivewed this l...


In [None]:
train_data = df.sample(frac = 0.9)
test_data = df.drop(train_data.index)
train_data.reset_index(drop=True,inplace=True)
test_data.reset_index(drop=True,inplace=True)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent") # tanító adatbázis leggyakoribb osztálya lesz mindig a predikció
dummy_clf.fit(train_data.text, train_data.label)
baseline_prediction = dummy_clf.predict(test_data) # predikció a kiértékelő adatbázison
accuracy_score(baseline_prediction, test_data.label)

0.5

## Zero-shot predikció

Az eddigi GPT2-es modell helyett használjuk a Llama 3.2 1 milliárd paraméteres változatát. Ennek a modellnek előnye, hogy hosszabb szövegekkel is megbírkózik, viszont sokkal tovább tartott volna fine-tuneing rajta.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained("unsloth/Llama-3.2-1B-Instruct")

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

model = model.to(device)

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/927 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

Using device: cuda


A prompt készítéséhez a numerikus címkéket érdemes szövegessé alakítani:


In [None]:
def get_answer(label):
    if label==0:
        return 'It is negative. '
    else:
        return 'It is positive. '

Az előző művelet inverze. A szöveges címkét képezi binárissá, mely a calculate_accuracy függvényhez kell.

In [None]:
def get_prediction(text):
    if 'negative' in text:
        return 0
    elif 'positive' in text:
        return 4
    else:
         return -1

In [None]:
import random

def get_prompt(train_data, val_sentence):
  prompt = val_sentence['text'] + '\n Is this review positive or negative? I think it is '
  return prompt, len(prompt)

A promptolás során végigiterálunk a test_data validációs adatkészleten úgy, hogy minden adathoz megadunk egy few-shot példákból álló promptot.
1. Előállítjuk a promptot
2. A tokenizer segítségével megkapjuk az `input_id`-kat, a modell bemeneteit
3. melyet átadunk a model.generate metódusnak
  - **`do_sample`**: szöveggenerálás során használjon-e mintavételezést
  - **`max_length`**: a generált szöveg maximális hossza (szavakban)
  - **`temperature`**: szabályozza a szöveggenerálásban a véletlenség szintjét
  - **`top_k`**: szabályozza, hány legvalószínűbb token legyen figyelembe véve minden generálási lépésnél
  - **`top_p`**: ha < 1 értékre van állítva, akkor a rendszer csak a legvalószínűbb tokenek legkisebb halmazát tartja meg a generálás során, ha azok valószínűsége eléri a `top_p` vagy nagyobb valószínűséget
([lásd bővebben](https://huggingface.co/blog/how-to-generate) )

4. A kimenetben kapott `generated_id`-ket dekódolni kell, hogy visszakapjuk a generált szót
5. Kinyerjük az eredményt, s elmentjük a predikált és a valós címkéket

In [None]:
def eval(train_data, test_data):
  predictions = []
  labels = []

  for i in range(len(test_data)):
    prompt, prompt_length = get_prompt(train_data, test_data.iloc[i])
    input_ids = tokenizer(prompt, return_tensors='pt', max_length=1000).input_ids.to(model.device)
    ml = int(input_ids.size()[1])
    with torch.no_grad():
      generated_ids = model.generate(input_ids, do_sample=True, max_length=ml + 2, temperature=0.3, top_k=10, top_p=0.1, pad_token_id=tokenizer.pad_token_id)
    print('-----------------------------------------------------------------------------------------------------')
    generated_text = tokenizer.decode(generated_ids[0])
    print(generated_text)
    predictions.append(get_prediction(generated_text[-10:]))
    labels.append(test_data.iloc[i]['label'])
    print('predicted: '+str(predictions[i])+' label: '+str(labels[i]))
  return predictions, labels

In [None]:
#tokenizer = AutoTokenizer.from_pretrained('gpt2', truncation_side="left")



In [None]:
labels, predictions = eval([], test_data)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


-----------------------------------------------------------------------------------------------------
<|begin_of_text|>The worst dental office I ever been. No one can beat it!!! You should avoid it at any time.
 Is this review positive or negative? I think it is 100%
predicted: -1 label: 0
-----------------------------------------------------------------------------------------------------
<|begin_of_text|>I love Steak N Shake. This one, however, leaves a lot to be desired. The food often comes out cold and the servers are apathetic at best. Most of the time, they're downright rude. I highly recommend avoiding this location if possible. There's one on Route 51 that's worth the 15 minutes it'll take you to get there.
 Is this review positive or negative? I think it is  negative.
predicted: 0 label: 0
-----------------------------------------------------------------------------------------------------
<|begin_of_text|>Every time we come here the service is laughably bad. On this visit a 

In [None]:
accuracy = accuracy_score(labels, predictions)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.77


## Few-shot learning

Prompt képzés. Egyenlő arányban vesz negatív és pozitív példákat a train set-ből, majd hozzá konkatenálja a validation set-ből a soron következő elemet címke nélkül.
Példa:



```
Few-shot examples:
This is a great movie with outstanding acting.
Is this sentence positive or negative?
Answer: positive

This is bad, a complete a waste of time.
Is this sentence positive or negative?
Answer: negative

Prompt:
I really enjoyed this movie, the set was spectacular.
Is this sentence positive or negative?
Answer:

```




In [None]:
import random

def get_prompt(train_data, val_sentence):
  limit = 2
  labels = [0, 4]
  few_shot_data = []


  for label in labels:
    c=0
    while c<limit:
      sample = train_data.sample(1).iloc[0]
      if sample["label"] == label:
        few_shot_data.append(sample)
        c+=1
  random.shuffle(few_shot_data)
  fin_few_shot_string = ''

  for j in range(0,len(few_shot_data)):
    fin_few_shot_string += few_shot_data[j]['text'] + ' Is this review positive or negative? '
    fin_few_shot_string += get_answer(few_shot_data[j]['label'])
    fin_few_shot_string += "\n"
  fin_few_shot_string += val_sentence['text'] + ' Is this review positive or negative?  I think it is '

  return fin_few_shot_string, len(fin_few_shot_string)

In [None]:
prompt, prompt_length = get_prompt(train_data, test_data.iloc[0,:])
print (prompt)

The vanilla lattes are a little sweet, but delicious. This place is perfect when the weather is nice and they can open the garage door. It's definitely a must-visit coffee shop on Walnut. Is this review positive or negative? It is positive. 
The italian coldcut sub was a travesty! Lots of STALE bread with skimpy meat portions, and no hots or flavor. Bread better suited for pigeons than people. I can find better Italian sandwiches back home in Baltimore. We'll feed these to our dogs when we get home.  You get what you pay for. Avoid this tourist trap! Is this review positive or negative? It is negative. 
Can't emphasize enough how great this place is. It makes me sad that everyone goes to Pamela's when La Feria is right above it - I always go here for breakfast/brunch and am always extremely satisfied.\n\nIt's a great environment: La Feria, in addition to a restaurant, also acts as a Peruvian shop. They sell clothes, toys, and even amazing chess sets. The seating is cozy, the people are

In [None]:
input_ids = tokenizer(prompt, return_tensors='pt', max_length=1000).input_ids.to(model.device)
ml = int(input_ids.size()[1])
with torch.no_grad():
  generated_ids = model.generate(input_ids, do_sample=True, max_length=ml + 2, temperature=0.3, top_k=10, top_p=0.1)
generated_text = tokenizer.decode(generated_ids[0])
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|>Great little place. Treats you like a local.Eaten here 3 times a week for a month.  Same overtime. Barb is always here. Is this review positive or negative? It is positive. 
This was technically the Clarion hotel so I think the name is wrong on the page. \n\nI'd like to start by saying that I'm not much of a complainer, but if could have given 0 stars, I would have. \n\nSo we are on our way up to room when we passed a vending machine with water underneath it. I assumed it was leaking and brushed it off. I don't care that much about little things like that especially since we were just sleeping there. After a great hockey game and a bar crawl, we came back to the hotel and got ready for bed. There was absolutely no hot water for my shower. So cold I wouldn't even consider hopping in really quick. It was late so I let it go and decided to try again in the morning. When I woke up it was the same thing. At this point I was fuming. I NEED a shower man!\n\nSo I put on my ang

In [None]:
labels, predictions = eval(train_data, test_data)

-----------------------------------------------------------------------------------------------------
<|begin_of_text|>Either my English muffin sandwich sat under a hot lamp all morning OR Burger King's kitchen staff just decided to deep fry the whole damn thing.  If that's not enough, the sausage smelled of animal decay.\n\nJust awful. Is this review positive or negative? It is negative. 
The worst chinese food I've had. I got the General Tso's Chicken and it came dry and very not spicy at all when I asked for extra spicy. Panda Express is better. Is this review positive or negative? It is negative. 
I love this place.  I am getting married here in July.  I've been a hotel guest a few times and the service is just fantastic.  The staff knows exactly what to do to make it an enjoyable stay.  The breakfast on the top floor was awesome.  Can't wait until July!  Maybe I should write an update in a few months? Is this review positive or negative? It is positive. 
This place was DELICIOUS!!

In [None]:
accuracy = accuracy_score(labels, predictions)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.35


Kérdés: Hogyan lehetne javítani az eredményen? Lehet-e?

# Gyakorló feladatok



*   Promptolással lehet javítani a zero-shot eredményeken?
*   Fine-tuneoljuk a GPT2-t a sentiment osztályozási feladatra! Milyen eredményeket ér el?

