## A2 Language Modeling 

### Task 1: Dataset Acquisition 

I'm a big StarWar fun. I searched through the Internet and found this dataset on Github that has all character background and the story line in a single source. One little problem is they are all in html file. Therefore, I need to remove the html tags from the file and extract the text only.

Data Source: https://github.com/AlbertoFormaggio1/star_wars_unstructured_dataset/tree/main

In [1]:
# import the libraries 
import re
from pathlib import Path 
from bs4 import BeautifulSoup
from tqdm import tqdm

In [2]:
# dataset path 
dataset_dir = Path("starwars_dataset")

In [3]:
def html2txt(html:str) -> str:

    soup = BeautifulSoup(html, 'lxml')

    # remove html tags
    for tag in soup(['script', 'style', 'noscript', 'header', 'footer', 'nav', 'aside']):
        tag.decompose()

    
    # get the text 
    text = soup.get_text(separator=' ')

    # normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text 

In [4]:
# initialize empty list 
docs = []
meta_data = []

In [5]:
html_files = sorted(dataset_dir.glob('*.html'))
print("# of HTML files:", len(html_files))

# of HTML files: 242


In [6]:
for fp in tqdm(html_files):
    html = fp.read_text(encoding='utf-8', errors='ignore')
    text = html2txt(html)

    if len(text) < 200:
        continue 

    docs.append(text)
    meta_data.append(fp.name)


# no. of usable documents
print('Usable documents:', len(docs))
# sample text 
print('Sample text:', docs[0][:300])

100%|██████████| 242/242 [00:00<00:00, 1643.03it/s]

Usable documents: 151
Sample text: Abafar A desert planet located in the Outer Rim with a completely white surface. Known as The Void, the planet is barely populated but is home to massive amounts of rhydonium, a scarce and volatile fuel.





In [7]:
# form corpus 
corpus = '\n\n'.join(docs)
# define output path 
out_path = Path('starwars_dataset/starwars_corpus.txt')
out_path.write_text(corpus, encoding='utf-8')

print('Output:', out_path.resolve())
print('Total characters:', len(corpus))

Output: /Users/kaungheinhtet/Desktop/AIT_NLP_Assignments/A2_Language_Modeling/starwars_dataset/starwars_corpus.txt
Total characters: 614127


### Preprocesing

We have the corpus now. However, the computer does not understand our ligustic words. It only understands the number like 0 and 1. Therfore, we need to tokenize and numericalize as usual to feed the corupus into our language model.

In [8]:
# import required libraries 
import torch 
import torch.nn as nn
import torch.optim as optim 
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator



In [9]:
# !python -c "import torch; print('torch', torch.__version__)"
# !python -c "import importlib.metadata as m; print('torchtext', m.version('torchtext'))"
# !python -c "import platform; print(platform.platform()); import struct; print('py', struct.calcsize('P')*8, 'bit')"


Let's check if we are using GPU or CPU. And also set the random seed to make sure the initialization are reproducible in all environments and devices.

In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

#make our work comparable if restarted the kernel
SEED = 42
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

cpu


In [11]:
# load the dataset
corpus = Path("starwars_dataset/starwars_corpus.txt")
text = corpus.read_text(encoding="utf-8")
print("Chars:", len(text))
print(text[:300])

Chars: 614127
Abafar A desert planet located in the Outer Rim with a completely white surface. Known as The Void, the planet is barely populated but is home to massive amounts of rhydonium, a scarce and volatile fuel.

Admiral Ackbar Fleet Admiral Gial Ackbar is a fictional character from the Star Wars franchise.


In [None]:
# use basic english tokenizer
# splits + lowercase + basic cleaning
tokenizer = get_tokenizer("basic_english")  

def yield_tokens(text):
    # tokenize by lines to avoid huge memory spikes
    for line in text.splitlines():
        toks = tokenizer(line)
        if toks:
            yield toks

specials = ["<pad>", "<unk>"]
vocab = build_vocab_from_iterator(yield_tokens(text), specials=specials, min_freq=1)
vocab.set_default_index(vocab["<unk>"])

In [None]:
# print no. of vocab size
vocab_size = len(vocab)
print("Vocab size:", vocab_size)
# sample tokens and token ids
print("Sample tokens:", tokenizer("Darth Vader is coming to Tatooine."))
print("Sample ids:", vocab(tokenizer("Darth Vader is coming to Tatooine.")))

Vocab size: 5751
Sample tokens: ['darth', 'vader', 'is', 'coming', 'to', 'tatooine', '.']
Sample ids: [81, 70, 10, 2004, 6, 212, 4]


In [None]:
# numericalization
token_ids = []
for line in text.splitlines():
    toks = tokenizer(line)
    if toks:
        token_ids.extend(vocab(toks))

print("Total tokens:", len(token_ids))
print("First 30 ids:", token_ids[:30])

Total tokens: 117284
First 30 ids: [3234, 8, 640, 53, 969, 9, 2, 619, 509, 15, 8, 2005, 1719, 1056, 4, 180, 14, 2, 1, 3, 2, 53, 10, 1421, 3755, 33, 10, 294, 6, 2114]


In [None]:
# create LM sequences
def make_lm_data(token_ids, seq_len=30):
    xs, ys = [], []
    for i in range(0, len(token_ids) - seq_len):
        chunk = token_ids[i:i+seq_len+1]
        xs.append(chunk[:-1])
        ys.append(chunk[1:])
    return torch.tensor(xs, dtype=torch.long), torch.tensor(ys, dtype=torch.long)

SEQ_LEN = 30
X, Y = make_lm_data(token_ids, seq_len=SEQ_LEN)
print("X shape:", X.shape, "Y shape:", Y.shape)

X shape: torch.Size([117254, 30]) Y shape: torch.Size([117254, 30])


In [16]:
from torch.utils.data import TensorDataset, DataLoader, random_split

dataset = TensorDataset(X, Y)

train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
train_ds, val_ds = random_split(dataset, [train_size, val_size], generator=torch.Generator().manual_seed(42))

BATCH_SIZE = 64
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE)


### Task 2: Model Training

In [None]:
class LSTMLM(nn.Module):
    def __init__(self, vocab_size, emb_dim=128, hidden_dim=256, num_layers=2, dropout=0.2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=vocab["<pad>"])
        self.lstm = nn.LSTM(emb_dim, hidden_dim, num_layers=num_layers, batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden=None):
        x = self.embedding(x)                 # [B, T] -> [B, T, E]
        out, hidden = self.lstm(x, hidden)    # [B, T, H]
        logits = self.fc(out)                 # [B, T, V]
        return logits, hidden

model = LSTMLM(vocab_size).to(device)


In [None]:
criterion = nn.CrossEntropyLoss()  # targets are token ids
optimizer = optim.Adam(model.parameters(), lr=1e-3)

def run_epoch(model, loader, train=True):
    model.train() if train else model.eval()
    total_loss = 0.0

    for x, y in loader:
        x, y = x.to(device), y.to(device)

        logits, _ = model(x)  # [B, T, V]
        loss = criterion(logits.view(-1, vocab_size), y.view(-1))

        if train:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        total_loss += loss.item()

    return total_loss / len(loader)

EPOCHS = 10
for epoch in range(1, EPOCHS + 1):
    train_loss = run_epoch(model, train_loader, train=True)
    val_loss = run_epoch(model, val_loader, train=False)
    print(f"Epoch {epoch}: train_loss={train_loss:.4f} val_loss={val_loss:.4f}")


Epoch 1: train_loss=5.2681 val_loss=4.3624
Epoch 2: train_loss=3.8701 val_loss=3.2889
Epoch 3: train_loss=2.9474 val_loss=2.4152
Epoch 4: train_loss=2.2467 val_loss=1.7938
Epoch 5: train_loss=1.7592 val_loss=1.3612
Epoch 6: train_loss=1.4174 val_loss=1.0650
Epoch 7: train_loss=1.1721 val_loss=0.8573
Epoch 8: train_loss=0.9932 val_loss=0.7129
Epoch 9: train_loss=0.8628 val_loss=0.6176
Epoch 10: train_loss=0.7674 val_loss=0.5541


In [None]:
import torch
import torch.nn.functional as F
#import random

def generate(model, prompt, max_new_tokens=50, temperature=1.0):
    model.eval()
    toks = tokenizer(prompt)
    ids = vocab(toks)
    x = torch.tensor([ids], dtype=torch.long).to(device)

    hidden = None
    for _ in range(max_new_tokens):
        logits, hidden = model(x, hidden)
        next_logits = logits[0, -1, :] / max(temperature, 1e-8)
        probs = F.softmax(next_logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1).item()

        ids.append(next_id)
        x = torch.tensor([[next_id]], dtype=torch.long).to(device)

    # decode (simple: join tokens)
    inv_vocab = vocab.get_itos()
    return " ".join(inv_vocab[i] for i in ids)

print(generate(model, "darth vader", max_new_tokens=40, temperature=0.9))


darth vader . edwards stated , the armorer replaces chewbacca , in operation , <unk> . the character has been voiced by jon favreau in the clone wars . bossk is a supporting character in the novel novel films , with an


After training, the LSTM language model parameters were saved using PyTorch’s state dictionary mechanism. The vocabulary mappings, token-to-index and index-to-token, were serialized separately in JSON format to ensure reproducibility and compatibility with the deployment environment. These artifacts will be later reused in the web application for inference.

In [None]:
# artefacts folder 

artefact_dir = Path('artefacts')
artefact_dir.mkdir(exist_ok=True)

In [22]:
# save the model
model_path = artefact_dir / 'starwars_lstm_model.pt'
torch.save(model.state_dict(), model_path)

print("Model saved to:", model_path.resolve())

Model saved to: /Users/kaungheinhtet/Desktop/AIT_NLP_Assignments/A2_Language_Modeling/artefacts/starwars_lstm_model.pt


In [23]:
import json
# idex -> token 
itos = vocab.get_itos()

# token -> index 
stoi = {token: idx for idx, token in enumerate(itos)}

with open(artefact_dir / 'vocab.json', 'w', encoding='utf-8') as f:
    json.dump(
        {
            'itos': itos,
            'stoi': stoi
        }, f, ensure_ascii=False, indent=2
    )

print('Vocab saved')

Vocab saved
