In [3]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab
%matplotlib inline

**bold text**
# Language Modeling with ``nn.Transformer`` and torchtext

https://pytorch.org/tutorials/beginner/transformer_tutorial.html

This is a tutorial on training a sequence-to-sequence model that uses the
[nn.Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)_ module.

The PyTorch 1.2 release includes a standard transformer module based on the
paper [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf)_.
Compared to Recurrent Neural Networks (RNNs), the transformer model has proven
to be superior in quality for many sequence-to-sequence tasks while being more
parallelizable. The ``nn.Transformer`` module relies entirely on an attention
mechanism (implemented as
[nn.MultiheadAttention](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html)_)
to draw global dependencies between input and output. The ``nn.Transformer``
module is highly modularized such that a single component (e.g.,
[nn.TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html)_)
can be easily adapted/composed.

<img src="file://../_static/img/transformer_architecture.jpg">


## Define the model




In [4]:
%%capture installs
# %pip install torchdata
# %pip install 'portalocker>=2.0.0'
%pip install polars

In this tutorial, we train a ``nn.TransformerEncoder`` model on a
language modeling task. The language modeling task is to assign a
probability for the likelihood of a given word (or a sequence of words)
to follow a sequence of words. A sequence of tokens are passed to the embedding
layer first, followed by a positional encoding layer to account for the order
of the word (see the next paragraph for more details). The
``nn.TransformerEncoder`` consists of multiple layers of
[nn.TransformerEncoderLayer](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html)_.
Along with the input sequence, a square attention mask is required because the
self-attention layers in ``nn.TransformerEncoder`` are only allowed to attend
the earlier positions in the sequence. For the language modeling task, any
tokens on the future positions should be masked. To produce a probability
distribution over output words, the output of the ``nn.TransformerEncoder``
model is passed through a linear layer followed by a log-softmax function.




In [5]:
import math
import os
from tempfile import TemporaryDirectory
from typing import Tuple

import re
from numbers import Number

from dataclasses import dataclass, field
from typing import Any, Callable, Dict, List, Optional, Tuple, Union

import torch
import polars as pl
import numpy as np

from torch.utils.data import DataLoader, Dataset

import torch
from torch import nn, Tensor
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset



In [6]:
torch.__version__

'2.0.0'

In [7]:
weather = pl.read_parquet("~/Hephaestus/data/weather_clean.parquet")
weather.head()


x,y,station_name,climate_identifier,province_code,local_year,local_month,local_day,local_hour,temp,temp_flag,dew_point_temp,dew_point_temp_flag,humidex,precip_amount,precip_amount_flag,relative_humidity,relative_humidity_flag,station_pressure,station_pressure_flag,wind_chill,wind_direction,wind_direction_flag,wind_speed,wind_speed_flag
f64,f64,str,str,str,str,str,str,str,f64,str,f64,str,f64,f64,str,f64,str,f64,str,f64,f64,str,f64,str
-114.000297,51.109447,"""CALGARY INT'L …","""3031094""","""AB""","""2010""","""1""","""1""","""0""",-21.6,"""missing""",-23.9,"""missing""",,,"""missing""",82.0,"""missing""",89.38,"""missing""",,,"""M""",,"""M"""
-114.000297,51.109447,"""CALGARY INT'L …","""3031094""","""AB""","""2010""","""1""","""1""","""1""",-21.2,"""missing""",-23.5,"""missing""",,,"""missing""",82.0,"""missing""",89.25,"""missing""",,,"""M""",,"""M"""
-114.000297,51.109447,"""CALGARY INT'L …","""3031094""","""AB""","""2010""","""1""","""1""","""2""",-20.8,"""missing""",-23.0,"""missing""",,,"""missing""",82.0,"""missing""",89.21,"""missing""",,,"""M""",,"""M"""
-114.000297,51.109447,"""CALGARY INT'L …","""3031094""","""AB""","""2010""","""1""","""1""","""3""",-20.4,"""missing""",-22.6,"""missing""",,,"""missing""",83.0,"""missing""",89.12,"""missing""",,,"""M""",,"""M"""
-114.000297,51.109447,"""CALGARY INT'L …","""3031094""","""AB""","""2010""","""1""","""1""","""4""",-20.4,"""missing""",-22.7,"""missing""",,,"""missing""",82.0,"""missing""",89.04,"""missing""",,,"""M""",,"""M"""


``PositionalEncoding`` module injects some information about the
relative or absolute position of the tokens in the sequence. The
positional encodings have the same dimension as the embeddings so that
the two can be summed. Here, we use ``sine`` and ``cosine`` functions of
different frequencies.




In [8]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        """
        Arguments:
            x: Tensor, shape ``[seq_len, batch_size, embedding_dim]``
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

In [9]:
def scale_numeric(df):
    for col in df.columns:
        if df[col].dtype == pl.Float64 or df[col].dtype == pl.Int64:
            df = df.with_columns(
                ((pl.col(col) - pl.col(col).mean()) / pl.col(col).std()).alias(col)
            )  # .select(pl.col(["dew_point_temp", "NewCOL"]))
    return df


weather = scale_numeric(weather)


In [10]:
def make_lower_remove_special_chars(df):
    df = df.with_columns(
        pl.col(pl.Utf8).str.to_lowercase().str.replace_all("[^a-zA-Z0-9]", " ")
    )
    return df


weather = make_lower_remove_special_chars(weather)

In [11]:
def get_unique_utf8_values(df):
    arr = np.array([])
    for col in df.select(pl.col(pl.Utf8)).columns:
        arr = np.append(arr, df[col].unique().to_numpy())

    return np.unique(arr)


weather_val_tokens = get_unique_utf8_values(weather)

In [12]:
def get_col_tokens(df):
    tokens = []
    for col_name in df.columns:
        sub_strs = re.split(r"[^a-zA-Z0-9]", col_name)
        tokens.extend(sub_strs)
    return np.unique(np.array(tokens))


weather_col_tokens = get_col_tokens(weather)

In [13]:
special_tokens = np.array(
    [
        "missing",
        "<batch-start>",
        "<batch-end>",
        "<pad>",
        "<unk>",
        ":",
        ",",
        "<row-start>",
        "<row-end>",
    ]
)
tokens = np.unique(
    np.concatenate(
        (
            weather_val_tokens,
            weather_col_tokens,
            special_tokens,
        )
    )
)
tokens


array([',', '0', '1', '10', '11', '12', '13', '14', '15', '16', '17',
       '18', '19', '2', '20', '2010', '2011', '2012', '2013', '2014',
       '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022',
       '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30',
       '3012206', '3026knq', '3031094', '3033890', '3035208', '3062696',
       '31', '4', '5', '6', '7', '8', '9', ':', '<batch-end>',
       '<batch-start>', '<pad>', '<row-end>', '<row-start>', '<unk>',
       'ab', 'amount', 'calgary int l cs', 'chill', 'climate', 'code',
       'day', 'dew', 'direction', 'edmonton international cs', 'flag',
       'fort mcmurray cs', 'hour', 'humidex', 'humidity', 'identifier',
       'lethbridge cda', 'local', 'm', 'missing', 'month', 'name',
       'pincher creek climate', 'point', 'precip', 'pressure', 'province',
       'relative', 'speed', 'station', 'sundre a', 'temp', 'wind', 'x',
       'y', 'year'], dtype=object)

In [14]:

class TransformerModel(nn.Module):

    def __init__(self, ntoken: int, d_model: int, nhead: int, d_hid: int,
                 nlayers: int, dropout: float = 0.5):
        super().__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = StringNumericEmbedding(d_model, embed_dim=ntoken)
        self.d_model = d_model

        self.decoder = nn.Linear(d_model, ntoken)
        self.numeric_decoder = nn.Linear(ntoken, 1)


        self.init_weights()

    def init_weights(self) -> None:
        initrange = 0.1
        self.encoder.embedding.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)
        # self.numeric_decoder.data.uniform(-initrange, initrange)

    def forward(self, src: Tensor, src_mask: Tensor) -> Tensor:
        """
        Arguments:
            src: Tensor, shape ``[seq_len, batch_size]``
            src_mask: Tensor, shape ``[seq_len, seq_len]``

        Returns:
            output Tensor of shape ``[seq_len, batch_size, ntoken]``
        """

        src = self.encoder(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, src_mask)
        numeric_output = self.numeric_decoder(output)
        output = self.decoder(output)

        return output, numeric_output


def generate_square_subsequent_mask(sz: int) -> Tensor:
    """Generates an upper-triangular matrix of ``-inf``, with zeros on ``diag``."""
    return torch.triu(torch.ones(sz, sz) * float('-inf'), diagonal=1)

In [15]:
@dataclass
class StringNumeric:
    value: Union[str, float]
    # all_tokens: np.array
    is_numeric: bool = field(default=None, repr=True)
    embedding_idx: int = field(default=None, repr=True)

    def __post_init__(self):
        if isinstance(self.value, str):
            self.is_numeric = False
        else:
            self.is_numeric = True
            self.embedding_idx = 0

    def gen_embed_idx(self, tokens: np.array):
        if not self.is_numeric:
            try:
                self.embedding_idx = np.where(tokens == self.value)[0][0] + 1
            except IndexError:
                self.embedding_idx = np.where(tokens == "<unk>")[0][0] + 1


x = StringNumeric(value="climate")
# xx = StringNumeric(value="climate", tokens=tokens)
print(x)
y = StringNumeric(value=1.0)
print(y)
z = StringNumeric(value="SomeRandomString")
print(z)
x.gen_embed_idx(tokens)
print(x)
# print(StringNumeric(value=1.0, all_tokens=tokens))


StringNumeric(value='climate', is_numeric=False, embedding_idx=None)
StringNumeric(value=1.0, is_numeric=True, embedding_idx=0)
StringNumeric(value='SomeRandomString', is_numeric=False, embedding_idx=None)
StringNumeric(value='climate', is_numeric=False, embedding_idx=64)


In [16]:
class TabularDataset(Dataset):
    # def __init__(self, df: pl.DataFrame, vocab_dict: Dict, m_dim: int) -> Dataset:
    def __init__(
        self,
        df: pl.DataFrame,
        vocab,
        shuffle_cols=False,
        n_rows=None,
        max_seq_length=700,
    ) -> Dataset:
        self.df = df
        self.vocab = vocab
        self.vocab_len = vocab.shape[0]
        self.shuffle_cols = shuffle_cols
        self.n_rows = n_rows
        self.max_seq_length = max_seq_length
        # self.vocab_dict = vocab_dict
        # self.embedding = nn.Embedding(len(self.string_vocab), m_dim)
        # Numeric Scale

        # self.col_vocab = self.df.columns

    def __len__(self):
        """Returns the number of sequences in the dataset."""
        length = self.df.shape[0] // self.n_rows
        return length

    def __getitem__(self, idx):
        """Returns a tuple of (input, target) at the given index."""
        batch = self.batch(idx)
        start = StringNumeric("<batch-start>")
        start.gen_embed_idx(self.vocab)
        end = StringNumeric("<batch-end>")
        end.gen_embed_idx(self.vocab)
        batch = self.padder(batch)
        batch = [start] + batch + [end]
        return batch

    def batch(self, idx):
        """Returns a batch from splitter from the starting index to the start
        index + n_rows"""
        batch = []
        for i in range(idx, idx + self.n_rows):
            row = self.df[i]
            row = self.splitter(row)
            batch.extend(row)

        return batch

    def padder(self, batch: List[StringNumeric]):
        diff = self.max_seq_length - len(batch)  # -2 for start and end
        if diff > 0:
            pad = StringNumeric("<pad>")
            pad.gen_embed_idx(self.vocab)
            batch.extend([pad] * diff)
        elif diff < 0:
            batch = batch[: self.max_seq_length - 1]
            # add warning
            new_end = StringNumeric("<batch-end>")
            new_end.gen_embed_idx(self.vocab)
            batch.append(new_end)
            print("Batch too long, truncating")
            Warning("Batch too long, truncating")
        return batch

    def splitter(self, row: pl.DataFrame) -> List[Union[str, float, None]]:
        vals = ["<row-start>"]
        cols = row.columns
        if self.shuffle_cols:
            np.random.shuffle(cols)

        for col in cols:
            value = row[col][0]
            col = col.split("_")
            vals.extend(col)
            vals.append(":")
            if isinstance(value, Number):
                vals.append(value)
            elif value is None:
                vals.append("missing")
                # Nones are only for numeric columns, others are "None"
            elif isinstance(value, str):
                vals.extend(value.split(" "))
            else:
                raise ValueError("Unknown type")
            vals.append(",")
        vals.append("<row-end>")

        vals = [StringNumeric(value=val) for val in vals]
        for val in vals:
            val.gen_embed_idx(self.vocab)

        return vals


weather_ds = TabularDataset(weather, tokens, shuffle_cols=False, n_rows=5)
print(weather_ds[0][:10])
print(weather_ds[0][-10:-1])


[StringNumeric(value='<batch-start>', is_numeric=False, embedding_idx=55), StringNumeric(value='<row-start>', is_numeric=False, embedding_idx=58), StringNumeric(value='x', is_numeric=False, embedding_idx=93), StringNumeric(value=':', is_numeric=False, embedding_idx=53), StringNumeric(value=-0.551099305737714, is_numeric=True, embedding_idx=0), StringNumeric(value=',', is_numeric=False, embedding_idx=1), StringNumeric(value='y', is_numeric=False, embedding_idx=94), StringNumeric(value=':', is_numeric=False, embedding_idx=53), StringNumeric(value=-0.37817406811183396, is_numeric=True, embedding_idx=0), StringNumeric(value=',', is_numeric=False, embedding_idx=1)]
[StringNumeric(value='<pad>', is_numeric=False, embedding_idx=56), StringNumeric(value='<pad>', is_numeric=False, embedding_idx=56), StringNumeric(value='<pad>', is_numeric=False, embedding_idx=56), StringNumeric(value='<pad>', is_numeric=False, embedding_idx=56), StringNumeric(value='<pad>', is_numeric=False, embedding_idx=56), 

In [17]:
print(len(weather_ds), len(weather_ds[0]), len(weather_ds[1]),len(weather_ds[2]))

127132 702 702 702


In [18]:
test_embed = []
for i in range(10):
    test_embed.append(weather_ds[i])
test_embed[0][0:10]

[StringNumeric(value='<batch-start>', is_numeric=False, embedding_idx=55),
 StringNumeric(value='<row-start>', is_numeric=False, embedding_idx=58),
 StringNumeric(value='x', is_numeric=False, embedding_idx=93),
 StringNumeric(value=':', is_numeric=False, embedding_idx=53),
 StringNumeric(value=-0.551099305737714, is_numeric=True, embedding_idx=0),
 StringNumeric(value=',', is_numeric=False, embedding_idx=1),
 StringNumeric(value='y', is_numeric=False, embedding_idx=94),
 StringNumeric(value=':', is_numeric=False, embedding_idx=53),
 StringNumeric(value=-0.37817406811183396, is_numeric=True, embedding_idx=0),
 StringNumeric(value=',', is_numeric=False, embedding_idx=1)]

In [19]:
# embedding = nn.Embedding(vocab_len, embed_dim, padding_idx=0)

In [20]:
class StringNumericEmbedding(nn.Module):
    def __init__(self, vocab_len: int, embed_dim: int=28):
        super().__init__()
        self.embedding = nn.Embedding(vocab_len, embed_dim, padding_idx=0)

    def forward(self, input: StringNumeric):
        embedding_index = torch.tensor([i.embedding_idx for i in input])
        embed = self.embedding(embedding_index)
        with torch.no_grad():
            for idx, value in enumerate(input):
                if value.is_numeric:
                    embed[idx][0] = value.value
        return embed

my_embed = StringNumericEmbedding(weather_ds.vocab_len+1)

t = my_embed(weather_ds[0])


## Load and batch data




In [21]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


The model hyperparameters are defined below. The ``vocab`` size is
equal to the length of the vocab object.




In [22]:
ntokens = len(weather_ds.vocab)  # size of vocabulary
emsize = 200  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in ``nn.TransformerEncoder``
nlayers = 2  # number of ``nn.TransformerEncoderLayer`` in ``nn.TransformerEncoder``
nhead = 2  # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)

## Run the model




We use [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)_
with the [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html)_
(stochastic gradient descent) optimizer. The learning rate is initially set to
5.0 and follows a [StepLR](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html)_
schedule. During training, we use [nn.utils.clip_grad_norm\_](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html)_
to prevent gradients from exploding.




In [37]:
len(weather_ds), weather.shape

(127132, (635664, 25))

In [38]:
import copy
import time

criterion = nn.CrossEntropyLoss()

def custom_loss(class_preds, numeric_preds,class_target, numeric_target, raw_data):
    cross_entropy = nn.CrossEntropyLoss()
    mse_loss = nn.MSELoss()

    class_loss = cross_entropy(class_preds, class_target)
    actual_num_idx = nn.Tensor([idx for idx, j in enumerate(raw_data) if j.is_numeric])
    pred_nums = numeric_preds[actual_num_idx]
    actual_nums = nn.Tensor([i.value for i in raw_data if i.is_numeric])
    reg_loss = mse_loss(pred_nums, actual_nums)

    return reg_loss + class_loss

lr = 5.0  # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

def train(model: nn.Module) -> None:
    model.train()  # turn on train mode
    total_loss = 0.
    log_interval = 200
    start_time = time.time()
    src_mask = generate_square_subsequent_mask(bptt).to(device)

    for batch, i in enumerate(range(len(weather_ds))):
        seq_len = data.size(0)
        if seq_len != bptt:  # only on last batch
            src_mask = src_mask[:seq_len, :seq_len]
        class_output, numeric_output = model(data, src_mask)
        flat_class_output = class_output.view(-1, ntokens)
        flat_numeric_output = numeric_output.view(-1, ntokens)
        loss = criterion(output.view(-1, ntokens), targets)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        if batch % log_interval == 0 and batch > 0:
            lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            cur_loss = total_loss / log_interval
            ppl = math.exp(cur_loss)
            print(f'| epoch {epoch:3d} | {batch:5d}/{num_batches:5d} batches | '
                  f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            total_loss = 0
            start_time = time.time()

def evaluate(model: nn.Module, eval_data: Tensor) -> float:
    model.eval()  # turn on evaluation mode
    total_loss = 0.
    src_mask = generate_square_subsequent_mask(bptt).to(device)
    with torch.no_grad():
        for i in range(0, eval_data.size(0) - 1, bptt):
            data, targets = get_batch(eval_data, i)
            seq_len = data.size(0)
            if seq_len != bptt:
                src_mask = src_mask[:seq_len, :seq_len]
            output = model(data, src_mask)
            output_flat = output.view(-1, ntokens)
            total_loss += seq_len * criterion(output_flat, targets).item()
    return total_loss / (len(eval_data) - 1)

Loop over epochs. Save the model if the validation loss is the best
we've seen so far. Adjust the learning rate after each epoch.



In [None]:
best_val_loss = float('inf')
epochs = 3

with TemporaryDirectory() as tempdir:
    best_model_params_path = os.path.join(tempdir, "best_model_params.pt")

    for epoch in range(1, epochs + 1):
        epoch_start_time = time.time()
        train(model)
        val_loss = evaluate(model, val_data)
        val_ppl = math.exp(val_loss)
        elapsed = time.time() - epoch_start_time
        print('-' * 89)
        print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
            f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
        print('-' * 89)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), best_model_params_path)

        scheduler.step()
    model.load_state_dict(torch.load(best_model_params_path)) # load best model states

| epoch   1 |   200/ 2928 batches | lr 5.00 | ms/batch 824.53 | loss  8.22 | ppl  3723.21
| epoch   1 |   400/ 2928 batches | lr 5.00 | ms/batch 800.76 | loss  6.93 | ppl  1020.40
| epoch   1 |   600/ 2928 batches | lr 5.00 | ms/batch 850.47 | loss  6.47 | ppl   642.27
| epoch   1 |   800/ 2928 batches | lr 5.00 | ms/batch 823.83 | loss  6.31 | ppl   552.42
| epoch   1 |  1000/ 2928 batches | lr 5.00 | ms/batch 820.16 | loss  6.20 | ppl   495.21
| epoch   1 |  1200/ 2928 batches | lr 5.00 | ms/batch 823.57 | loss  6.17 | ppl   476.97
| epoch   1 |  1400/ 2928 batches | lr 5.00 | ms/batch 879.57 | loss  6.12 | ppl   455.25
| epoch   1 |  1600/ 2928 batches | lr 5.00 | ms/batch 831.05 | loss  6.11 | ppl   450.69
| epoch   1 |  1800/ 2928 batches | lr 5.00 | ms/batch 838.54 | loss  6.03 | ppl   415.73


KeyboardInterrupt: ignored

## Evaluate the best model on the test dataset




In [None]:
test_loss = evaluate(model, test_data)
test_ppl = math.exp(test_loss)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | '
      f'test ppl {test_ppl:8.2f}')
print('=' * 89)

In [None]:
class StringNumericEmbedding(nn.Embedding):
    def __init__(self, string_numeric:StringNumeric, embed_dim: int=28):
