# Text Models with Transformers

This notebook demonstrates how to work with modern text models using the Transformers library. We'll cover:

1. Loading and using tokenizers
2. Understanding token IDs and decoding
3. Working with language models
4. Text generation with different parameters

Let's get started!

In [None]:
from transformers import AutoTokenizer

prompt = "It was a dark and stormy"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
input_ids = tokenizer(prompt).input_ids
print(input_ids)

In [None]:
# Let's look at each token individually
for t in input_ids:
    print(t, "\t:", tokenizer.decode(t))

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")

In [None]:
# Convert input to tensor format
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Get model outputs
outputs = model(input_ids)
outputs.logits.shape  # (batch_size, sequence_length, vocab_size)

In [None]:
# Get the most likely next token
final_logits = model(input_ids).logits[0, -1]
final_logits.argmax()

# Decode the predicted token
print("Most likely next token:", tokenizer.decode(final_logits.argmax()))

In [None]:
import torch

# Get top 10 most likely next tokens
top10_logits = torch.topk(final_logits, 10)
print("\nTop 10 most likely next tokens:")
for index in top10_logits.indices:
    print(tokenizer.decode(index))

In [None]:
# Generate text with different sampling parameters
print("\nBasic generation:")
output_ids = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output_ids[0]))

print("\nBeam search (more focused):")
beam_output = model.generate(
    input_ids,
    num_beams=5,
    max_new_tokens=30,
)
print(tokenizer.decode(beam_output[0]))

print("\nHigh temperature (more creative):")
sampling_output = model.generate(
    input_ids,
    do_sample=True,
    temperature=3.0,
    max_new_tokens=40,
    top_k=0,
)
print(tokenizer.decode(sampling_output[0]))

In [None]:
import sys
!{sys.executable} -m pip install torch transformers --upgrade

# Text Model Tokenization Example

This notebook demonstrates how to use the Transformers library to tokenize text using the BERT model.

## Setup Instructions
1. Make sure you're using the "Python (Text Models)" kernel
2. If packages are missing, run the setup cell below

## What we'll do:
1. Set up the required packages
2. Import the AutoTokenizer
3. Create a sample prompt
4. Tokenize the text and get input IDs

In [2]:
import sys
!{sys.executable} -m pip install transformers

Collecting transformers
  Using cached transformers-4.56.0-py3-none-any.whl.metadata (40 kB)
  Using cached transformers-4.56.0-py3-none-any.whl.metadata (40 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Collecting regex!=2019.12.17 (from transformers)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2025.9.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (40 kB)
  Downloading regex-2025.9.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (40 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Using cached tokenizers-0.22.0-cp39-abi3-

In [None]:
import torch
import transformers
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")

# Test with a simple tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
result = tokenizer.encode("Hello, world!")
print(f"Tokenized result: {result}")