<a href="https://colab.research.google.com/github/Cheetah-lhp/Privacy-Backdoors/blob/main/Decoding_methods_for_language_generation_with_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [23]:
!pip install -q transformers

In [24]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

torch_device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("gpt2")

model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(torch_device)

In [25]:

#greedy search
#encode context
model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').to(torch_device)
greedy_output = model.generate(**model_inputs, max_new_tokens = 40) #so luong token cua output
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure


In [26]:
#beam search
beam_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5, #so nhanh se duoc xet de tinh xac suat tong the (overall probability)
    early_stopping=True, # (optional) generation is finished when all beam hypotheses reached the EOS token
    no_repeat_ngram_size=2, #(optional) n-grams penalty makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to 0.
    num_return_sequences=5 #(optional) choose the number n ì highest scoring beams (num_return_sequences <= num_beams !!)
)

print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea to
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time to take a
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea

In [27]:
#sampling
from transformers import set_seed
set_seed(42)

sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0, #activate sampling and deactivate top_k by setting top_k sampling to 0
    temperature=0.6, # to make the distribution sharperincreasing the likelihood of high probability words and decreasing the likelihood of low probability words) by lowering the so-called temperature of the softmax.
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, and I was delighted to have him on my show, so I had a chance to see him. I was very impressed with his body, and I am looking forward to seeing what he has to


In [28]:
#top-k sampling
set_seed(42)

sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50, #only consider the 50 highest probability words
)
print("Output:\n" + 100 *'-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, which is a little unusual in this part of our family. He's a friendly, calm kind of dog, and I've always wanted to have him around, and I always wanted to go with


In [29]:
#top-p (nucleus) sampling
set_seed(42)

# set top_k to 50
sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_p=0.92,
    top_k=0,
    num_return_sequences=3,
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, a male, to hang out in the zoo," Salinsky said. "Even though I've been with him, we're in the same boat. People have not been able to separate the two
1: I enjoy walking with my cute dog and cooking hot dogs, and my favorite thing about American food is the foods we have all been given, which is why I have called our shops "Cracker Jack". And yes, these are American
2: I enjoy walking with my cute dog," Noro, 25, says. "I think there's lots of things I have learned about parenting since I was a young child, and now that I have a mom that also works there,


In [32]:
# Cài đặt nbconvert nếu chưa có
!pip install nbconvert

# Sửa file notebook
!jupyter nbconvert --to notebook --inplace --ClearMetadataPreprocessor.enabled=True your_file.ipynb

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr