Báo cáo thực hành lab 4

Tên: Lê Huỳnh Cao Dương

Đề bài: Sequence representation for RNNs involves converting your sequential data into a numerical format that the model can understand. 

In [1]:
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

# Detect device: MPS (Apple Silicon) > CPU
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using Apple Silicon GPU (MPS).")
else:
    device = torch.device("cpu")
    print("Using CPU.")

print("PyTorch version:", torch.__version__)


Using Apple Silicon GPU (MPS).
PyTorch version: 2.8.0


Tạo bộ dữ liệu mẫu (toy dataset)

In [2]:
texts = [
    "I love this movie",
    "This film is terrible",
    "What an amazing experience",
    "I hate this",
    "Best movie ever",
    "Worst film ever",
    "I enjoyed it a lot",
    "I did not like this movie at all"
]
# Labels: 1 = positive, 0 = negative
labels = [1, 0, 1, 0, 1, 0, 1, 0]

print("Number of samples:", len(texts))
print("Sample 0:", texts[0], "-> label", labels[0])
print("Sample last:", texts[-1], "-> label", labels[-1])


Number of samples: 8
Sample 0: I love this movie -> label 1
Sample last: I did not like this movie at all -> label 0


Tokenization & basic preprocessing

In [3]:
import re

def simple_tokenize(text):
    """
    - Lowercase
    - Remove characters that are not letters/numbers/whitespace
    - Split on whitespace
    """
    text = text.lower().strip()
    # remove punctuation except alphanumeric and whitespace
    text = re.sub(r"[^a-z0-9\s]", "", text)
    tokens = text.split()
    return tokens

# Apply to our toy dataset
tokenized_texts = [simple_tokenize(t) for t in texts]

# Print results for inspection
for i, (orig, toks) in enumerate(zip(texts, tokenized_texts)):
    print(f"{i:02d} | Orig: {orig}")
    print(f"   Tokens: {toks}")

00 | Orig: I love this movie
   Tokens: ['i', 'love', 'this', 'movie']
01 | Orig: This film is terrible
   Tokens: ['this', 'film', 'is', 'terrible']
02 | Orig: What an amazing experience
   Tokens: ['what', 'an', 'amazing', 'experience']
03 | Orig: I hate this
   Tokens: ['i', 'hate', 'this']
04 | Orig: Best movie ever
   Tokens: ['best', 'movie', 'ever']
05 | Orig: Worst film ever
   Tokens: ['worst', 'film', 'ever']
06 | Orig: I enjoyed it a lot
   Tokens: ['i', 'enjoyed', 'it', 'a', 'lot']
07 | Orig: I did not like this movie at all
   Tokens: ['i', 'did', 'not', 'like', 'this', 'movie', 'at', 'all']


re.sub(r"[^a-z0-9\s]", "", text): loại tất cả ký tự không phải chữ a–z, số 0–9 hoặc khoảng trắng — đơn giản và hiệu quả cho ví dụ tiếng Anh.

.lower(): chuẩn hoá chữ hoa → chữ thường để tránh phân tách cùng từ khác case.

text.split(): split theo whitespace → đơn giản, phù hợp với example tiếng Anh.

In từng câu cùng token để bạn dễ kiểm tra lỗi (vd punctuation không được loại, chữ hoa, từ rời...).

Build vocabulary (stoi, itos) & show counts
Mục đích:

Đếm tần suất token trong tập dữ liệu.

Tạo vocab với token đặc biệt <PAD> và <UNK>.

Tạo hai dict: stoi (string→index) và itos (index→string).

In ra kích thước vocab và vài token đầu/đuôi để bạn kiểm tra.

In [4]:
from collections import Counter

# tokenized_texts đã có từ Cell 3
all_tokens = [tok for sent in tokenized_texts for tok in sent]
vocab_counter = Counter(all_tokens)

# Special tokens
PAD_TOKEN = "<PAD>"
UNK_TOKEN = "<UNK>"

# Build vocab: special tokens first, then tokens sorted by frequency desc then alphabetically
sorted_tokens = sorted(vocab_counter.items(), key=lambda x: (-x[1], x[0]))
vocab = [PAD_TOKEN, UNK_TOKEN] + [tok for tok, _ in sorted_tokens]

stoi = {tok: i for i, tok in enumerate(vocab)}
itos = {i: tok for tok, i in stoi.items()}

# Print summary
print("Total raw tokens:", len(all_tokens))
print("Unique tokens (vocab size without specials):", len(vocab) - 2)
print("Vocab size (with specials):", len(vocab))
print()
print("Top tokens (token:count):")
for tok, cnt in sorted_tokens[:10]:
    print(f"  {tok}: {cnt}")
print()
print("Some vocab entries (first 12):")
for i in range(min(12, len(vocab))):
    print(f"  idx {i:02d} -> {vocab[i]}")
print()
print("Index of PAD token:", stoi[PAD_TOKEN])
print("Index of UNK token:", stoi[UNK_TOKEN])


Total raw tokens: 34
Unique tokens (vocab size without specials): 24
Vocab size (with specials): 26

Top tokens (token:count):
  i: 4
  this: 4
  movie: 3
  ever: 2
  film: 2
  a: 1
  all: 1
  amazing: 1
  an: 1
  at: 1

Some vocab entries (first 12):
  idx 00 -> <PAD>
  idx 01 -> <UNK>
  idx 02 -> i
  idx 03 -> this
  idx 04 -> movie
  idx 05 -> ever
  idx 06 -> film
  idx 07 -> a
  idx 08 -> all
  idx 09 -> amazing
  idx 10 -> an
  idx 11 -> at

Index of PAD token: 0
Index of UNK token: 1


Text → Integer Encoding

In [5]:
UNK_IDX = stoi["<UNK>"]

def text_to_sequence(tokens, stoi):
    """
    Convert list of tokens to list of integer indices.
    If a token is not in vocab -> assign UNK index.
    """
    return [stoi.get(tok, UNK_IDX) for tok in tokens]

# Apply encoding to all sentences
sequences = [text_to_sequence(tokens, stoi) for tokens in tokenized_texts]

print("Integer-encoded sequences:")
for i, seq in enumerate(sequences):
    print(f"{i:02d}:", seq)


Integer-encoded sequences:
00: [2, 21, 3, 4]
01: [3, 6, 17, 23]
02: [24, 10, 9, 15]
03: [2, 16, 3]
04: [12, 4, 5]
05: [25, 6, 5]
06: [2, 14, 18, 7, 20]
07: [2, 13, 22, 19, 3, 4, 11, 8]


Padding & Truncation (xử lý độ dài chuỗi)

In [6]:
from typing import List

PAD_IDX = stoi["<PAD>"]
MAX_LEN = 6  # chosen based on observed sentence lengths

def pad_sequences(seqs: List[List[int]], max_len:int, pad_idx:int=PAD_IDX):
    padded = []
    lengths = []  # store original lengths before padding
    for seq in seqs:
        if len(seq) >= max_len:
            padded.append(seq[:max_len])  # truncation
            lengths.append(max_len)
        else:
            padded.append(seq + [pad_idx] * (max_len - len(seq)))  # padding
            lengths.append(len(seq))
    return padded, lengths

padded_seqs, seq_lengths = pad_sequences(sequences, MAX_LEN, PAD_IDX)

print("Padded sequences:")
for i, seq in enumerate(padded_seqs):
    print(f"{i:02d}:", seq, " (orig length:", seq_lengths[i], ")")


Padded sequences:
00: [2, 21, 3, 4, 0, 0]  (orig length: 4 )
01: [3, 6, 17, 23, 0, 0]  (orig length: 4 )
02: [24, 10, 9, 15, 0, 0]  (orig length: 4 )
03: [2, 16, 3, 0, 0, 0]  (orig length: 3 )
04: [12, 4, 5, 0, 0, 0]  (orig length: 3 )
05: [25, 6, 5, 0, 0, 0]  (orig length: 3 )
06: [2, 14, 18, 7, 20, 0]  (orig length: 5 )
07: [2, 13, 22, 19, 3, 4]  (orig length: 6 )


One-Hot Encoding (demo)

In [7]:
import numpy as np

vocab_size = len(vocab)

def one_hot_encode(padded_sequences, vocab_size):
    num_sentences = len(padded_sequences)
    seq_len = len(padded_sequences[0])
    one_hot = np.zeros((num_sentences, seq_len, vocab_size), dtype=np.float32)

    for i in range(num_sentences):
        for j in range(seq_len):
            token_idx = padded_sequences[i][j]
            one_hot[i, j, token_idx] = 1.0

    return one_hot

one_hot_result = one_hot_encode(padded_seqs, vocab_size)

print("One-hot encoded tensor shape:", one_hot_result.shape)
print("Example (sentence 0):")
print(one_hot_result[0])


One-hot encoded tensor shape: (8, 6, 26)
Example (sentence 0):
[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
  0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0.]]


Embedding Representation + RNN (LSTM)

In [8]:
import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, padded_seqs, labels):
        self.X = torch.tensor(padded_seqs, dtype=torch.long)
        self.y = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

dataset = TextDataset(padded_seqs, labels)

# We use small batch size to inspect things clearly
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

for xb, yb in dataloader:
    print("Batch X shape:", xb.shape)
    print("Batch y shape:", yb.shape)
    print(xb)
    print(yb)
    break


Batch X shape: torch.Size([2, 6])
Batch y shape: torch.Size([2])
tensor([[ 2, 21,  3,  4,  0,  0],
        [25,  6,  5,  0,  0,  0]])
tensor([1, 0])


Định nghĩa mô hình

In [9]:
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, pad_idx):
        super().__init__()
        
        # 1) Embedding layer: học vector cho mỗi token
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        
        # 2) LSTM layer: học quan hệ tuần tự
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=1,
            batch_first=True
        )
        
        # 3) Fully-connected layer: phân loại
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        # x shape: (batch, seq_len)
        
        embedded = self.embedding(x)  
        # embedded shape: (batch, seq_len, embed_dim)
        
        output, (hidden, cell) = self.lstm(embedded)
        # hidden shape: (num_layers, batch, hidden_dim)
        
        # Với 1 layer: lấy hidden state cuối
        last_hidden = hidden[-1]  # shape: (batch, hidden_dim)
        
        logits = self.fc(last_hidden)  # shape: (batch, output_dim)
        return logits

# Instantiate model
vocab_size = len(vocab)
embed_dim = 32       # có thể tăng lên 50–300 để mô hình mạnh hơn
hidden_dim = 64
output_dim = 2       # positive / negative
pad_idx = stoi["<PAD>"]

model = SentimentLSTM(vocab_size, embed_dim, hidden_dim, output_dim, pad_idx).to(device)
model


SentimentLSTM(
  (embedding): Embedding(26, 32, padding_idx=0)
  (lstm): LSTM(32, 64, batch_first=True)
  (fc): Linear(in_features=64, out_features=2, bias=True)
)

Training Loop

Loss function: CrossEntropyLoss

Optimizer: Adam

Train 30 epoch (vì dataset nhỏ sẽ học nhanh)

In [10]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

num_epochs = 30

model.train()
for epoch in range(num_epochs):
    epoch_loss = 0
    correct = 0
    total = 0
    
    for xb, yb in dataloader:
        xb = xb.to(device)
        yb = yb.to(device)
        
        optimizer.zero_grad()
        logits = model(xb)
        
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        
        preds = logits.argmax(dim=1)
        correct += (preds == yb).sum().item()
        total += yb.size(0)
    
    acc = correct / total
    print(f"Epoch {epoch+1}/{num_epochs} - Loss: {epoch_loss:.4f} - Acc: {acc:.4f}")


Epoch 1/30 - Loss: 2.9493 - Acc: 0.3750
Epoch 2/30 - Loss: 2.4702 - Acc: 0.6250
Epoch 3/30 - Loss: 1.8880 - Acc: 0.8750
Epoch 4/30 - Loss: 1.0469 - Acc: 1.0000
Epoch 5/30 - Loss: 0.1586 - Acc: 1.0000
Epoch 6/30 - Loss: 0.0079 - Acc: 1.0000
Epoch 7/30 - Loss: 0.0017 - Acc: 1.0000
Epoch 8/30 - Loss: 0.0002 - Acc: 1.0000
Epoch 9/30 - Loss: 0.0001 - Acc: 1.0000
Epoch 10/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 11/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 12/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 13/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 14/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 15/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 16/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 17/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 18/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 19/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 20/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 21/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 22/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 23/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 24/30 - Loss: 0.0000 - Acc: 1.0000
Epoch 25/30 - Loss: 0.000

Dự đoán câu mới (Inference)

In [13]:
def predict_sentiment(model, sentence):
    model.eval()
    
    # 1) tokenize
    toks = simple_tokenize(sentence)
    
    # 2) convert to indices
    seq = text_to_sequence(toks, stoi)
    
    # 3) pad
    if len(seq) < MAX_LEN:
        seq = seq + [PAD_IDX] * (MAX_LEN - len(seq))
    else:
        seq = seq[:MAX_LEN]
        
    seq = torch.tensor([seq], dtype=torch.long).to(device)
    
    # 4) forward
    logits = model(seq)
    pred = logits.argmax(dim=1).item()
    
    return "Positive" if pred == 1 else "Negative"


In [14]:
print(predict_sentiment(model, "I really enjoyed this movie"))
print(predict_sentiment(model, "I hate this film"))
print(predict_sentiment(model, "not good at all"))
print(predict_sentiment(model, "this was amazing"))
print(predict_sentiment(model, "worst movie ever"))


Positive
Negative
Positive
Negative
Positive


Mặc dù mô hình đạt 100% accuracy trên tập huấn luyện, nhưng dự đoán trên câu mới còn sai → chứng tỏ hiện tượng overfitting do tập dữ liệu nhỏ.
Điều này cho thấy vai trò quan trọng của kích thước dữ liệu và biểu diễn embedding trong bài toán NLP.