In [1]:
!pip install transformers torch



# Bài 1 : Khôi phục Masked Token (Masked Language Modeling)

In [3]:
from transformers import pipeline

# 1. Tải pipeline "fill-mask"
mask_filler = pipeline("fill-mask") # [cite: 43, 44, 45]

# 2. Câu đầu vào với token [MASK]
input_sentence = "Hanoi is the <mask> of Vietnam."

# 3. Thực hiện dự đoán
predictions = mask_filler(input_sentence, top_k=5) # [cite: 49]

# 4. In kết quả
print(f"Câu gốc: {input_sentence}") # [cite: 51]
for pred in predictions:
    print(f"Dự đoán: '{pred['token_str']}' với độ tin cậy: {pred['score']:.4f}") # [cite: 52]
    print(f" Câu hoàn chỉnh: {pred['sequence']}") # [cite: 53]

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Câu gốc: Hanoi is the <mask> of Vietnam.
Dự đoán: ' capital' với độ tin cậy: 0.9341
 Câu hoàn chỉnh: Hanoi is the capital of Vietnam.
Dự đoán: ' Republic' với độ tin cậy: 0.0300
 Câu hoàn chỉnh: Hanoi is the Republic of Vietnam.
Dự đoán: ' Capital' với độ tin cậy: 0.0105
 Câu hoàn chỉnh: Hanoi is the Capital of Vietnam.
Dự đoán: ' birthplace' với độ tin cậy: 0.0054
 Câu hoàn chỉnh: Hanoi is the birthplace of Vietnam.
Dự đoán: ' heart' với độ tin cậy: 0.0014
 Câu hoàn chỉnh: Hanoi is the heart of Vietnam.


# Bài 2 : Dự đoán từ tiếp theo (Next Token Prediction)

In [4]:
from transformers import pipeline

# 1. Tải pipeline "text-generation"
generator = pipeline("text-generation")

# 2. Đoạn văn bản mồi (prompt)
prompt = "The best thing about learning NLP is"

# 3. Sinh văn bản
generated_texts = generator(prompt, max_length=50, num_return_sequences=3)

# 4. In kết quả
print(f"Câu mồi: '{prompt}'")
for i, text in enumerate(generated_texts):
    print(f"\n--- Văn bản được sinh ra #{i+1} ---")
    print(text['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Câu mồi: 'The best thing about learning NLP is'

--- Văn bản được sinh ra #1 ---
The best thing about learning NLP is that it allows you to know what to do when the time comes, and you can really learn from it. So, for some reason, I think it's really good to have a lot of good stuff in NLP."

According to the Guardian, he was inspired to write the book after reading some of the other books he had read in the past. "I had a really good time with Neil Gaiman," he said. "I think he made you look at things from a different perspective. When you're reading things from a different perspective, it's amazing that you can actually understand them from one perspective."

NLP has also helped him develop his writing style, as well as his writing style. "It's a little bit of a creative thing. I love to work on my writing and that's what I wanted to do," says the former writer. "I've always been a little bit worried about the amount of material I can put on the page, but now it's more time consumin

# Bài 3 : Tính toán Vector biểu diễn của câu (Sentence Representation)

In [5]:
import torch
from transformers import AutoTokenizer, AutoModel

# 1. Chọn một mô hình BERT
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# 2. Câu đầu vào
sentences = ["This is a sample sentence."] # [cite: 96]

# 3. Tokenize câu
inputs = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    return_tensors='pt'
)

# 4. Đưa qua mô hình để lấy hidden states
with torch.no_grad():
    outputs = model(**inputs)

last_hidden_state = outputs.last_hidden_state

# 5. Thực hiện Mean Pooling
attention_mask = inputs['attention_mask'] #
mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()

sum_embeddings = torch.sum(last_hidden_state * mask_expanded, 1)

sum_mask = torch.clamp(mask_expanded.sum(1), min=1e-9)


sentence_embedding = sum_embeddings / sum_mask

# 6. In kết quả
print("Vector biểu diễn của câu:")
print(sentence_embedding)
print("\nKích thước của vector:", sentence_embedding.shape)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Vector biểu diễn của câu:
tensor([[-6.3874e-02, -4.2837e-01, -6.6779e-02, -3.8430e-01, -6.5784e-02,
         -2.1826e-01,  4.7636e-01,  4.8659e-01,  4.0647e-05, -7.4273e-02,
         -7.4740e-02, -4.7635e-01, -1.9773e-01,  2.4824e-01, -1.2162e-01,
          1.6678e-01,  2.1045e-01, -1.4576e-01,  1.2636e-01,  1.8635e-02,
          2.4640e-01,  5.7090e-01, -4.7014e-01,  1.3782e-01,  7.3650e-01,
         -3.3808e-01, -5.0331e-02, -1.6452e-01, -4.3517e-01, -1.2900e-01,
          1.6516e-01,  3.4004e-01, -1.4930e-01,  2.2422e-02, -1.0488e-01,
         -5.1916e-01,  3.2964e-01, -2.2162e-01, -3.4206e-01,  1.1993e-01,
         -7.0148e-01, -2.3126e-01,  1.1224e-01,  1.2550e-01, -2.5191e-01,
         -4.6374e-01, -2.7261e-02, -2.8415e-01, -9.9249e-02, -3.7017e-02,
         -8.9192e-01,  2.5005e-01,  1.5816e-01,  2.2701e-01, -2.8497e-01,
          4.5300e-01,  5.0945e-03, -7.9441e-01, -3.1008e-01, -1.7403e-01,
          4.3029e-01,  1.6816e-01,  1.0590e-01, -4.8987e-01,  3.1856e-01,
          3.