<a href="https://colab.research.google.com/github/LCaravaggio/NLP/blob/main/08_LanguageModels/NeuralLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Language Modeling con NNs

Vamos a usar `pytorch` para el modelo y `datasets` de HF para el corpus.

In [None]:
%%capture
!pip install datasets==2.19.0 watermark

In [None]:
%%capture
!python -m spacy download en_core_web_sm

In [None]:
%load_ext watermark

In [None]:
%watermark -vp datasets,torch,nltk,spacy

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

datasets: 2.19.0
torch   : 2.2.1+cu121
nltk    : 3.8.1
spacy   : 3.7.4



In [None]:
import re

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch import Tensor
from datasets import load_dataset
from nltk.lm.preprocessing import pad_both_ends
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator, Vocab
from torch.utils.data import DataLoader

## Data

Vamos a usar el corpus de reviews en yelp solo a modo ilustrativo. Cada documento con todos sus atributos (texto, tags, etc.) es un "example".

Lean el [brevísimo tutorial de HF sobre `datasets`](https://huggingface.co/docs/datasets/tutorial).

In [None]:
dataset = load_dataset("yelp_review_full")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
# vemos la estructura:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})


In [None]:
# vemos un review al azar:
dataset["train"][33]

{'label': 2,
 'text': 'If you want a true understanding of Pittsburgh in the morning, come here. This greasy spoon is always packed, and is one of the better of its kind south of the city.\\n\\nThey serve waffles in halves, which is great. The eggs and toast are good, the homemade hot sausage is excellent. The drawback are the barely cooked potatoes.\\n\\nIf you\'re hungry, get \\"The Mixed Grill\\"... Gab and Eat\'s brand of the \\"kitchen sink\\" breakfast that all Midwest places are about.'}

In [None]:
# lo achicamos para trabajar mas rapido: 5k train, 5k test
dataset["train"] = dataset["train"].select(range(0, 5_000))
dataset["test"] = dataset["test"].select(range(0, 5_000))

In [None]:
# trabajamos solo con los textos y nos olvidamos de dataset
texts_train = dataset["train"]["text"]
texts_test = dataset["test"]["text"]
# del dataset

In [None]:
texts_train[0]

"dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."

## Tokenización

Vamos a usar el tokenizer para inglés de `spacy` (instanciado desde `torchtext`).

El objetivo es generar una **lista de trigramas para entrenar la
NN** con trigramas (2 palabras de contexto/historia y 1 target). Vamos a:

1. Construir un vocab en base al tokenizador -- el vocab son los tokens que nuestro modelo reconoce.

    * Vamos a usar `torchtext` en lugar de `nltk` porque nos permite mapear mejor de un token a un token_id.
    
    * Tenemos que hacer padding con BOS y EOS tokens y vamos a usar min frec = 2 (para definir qué tokens son `UNK`)

2. Tokenizar cada doc y convertir a token ids según el vocab.

3. Pasar de tokens a trigramas y generar una sola lista con todos los samples de entrenamiento.


In [None]:
# tokenizer default para ingles con reglas de puntacion, contracciones, etc:
tokenizer = get_tokenizer('spacy')



In [None]:
def tokenize_doc(doc: str, ngram_order: int = 3) -> list:
  """Convierte documento a list of tokens. Usamos esta fn para armar el vocab.
  NOTE aca BOS y EOS son end-of-seq. y beg-of-seq. tokens.
  Deberiamos usar sentence_tokenize si queremos usar beg-of-sent. y end-of-sent.
  """
  # reemplaza todo whitespace por un solo espacio
  # NOTE aca se pueden incluir operaciones de limpieza adicionales
  text = re.sub(r'\s+', ' ', doc)
  res = list(pad_both_ends(tokenizer(text), n=ngram_order))
  return res

def doc2tensor(doc: str, vocab: Vocab, ngram_order: int = 3) -> Tensor:
    """Convierte documento a un flat Tensor de vocab token ids
    """
    tokens = tokenize_doc(doc, ngram_order=ngram_order)
    idxs = vocab(tokens)
    res = torch.tensor(idxs, dtype=torch.long)
    return res

def doc2ngrams(doc: str, vocab: Vocab, ngram_order: int = 3) -> list:
  """Convierte un documento en tuplas de
  ([ idx_i-context_size, ..., idx_i-1 ], target_idx)
  """
  tokens = doc2tensor(doc, vocab, ngram_order=ngram_order)
  ngrams = [
      (tokens[(i-ngram_order):(i-1)], tokens[i-1])
      for i in range(ngram_order, len(tokens))
  ]
  return ngrams

In [None]:
# por ejemplo:
print(texts_train[33])
print(tokenize_doc(texts_train[33]))
# la limpieza se puede mejorar mucho (por ej hay "\\n" que no se parsearon como newline)

If you want a true understanding of Pittsburgh in the morning, come here. This greasy spoon is always packed, and is one of the better of its kind south of the city.\n\nThey serve waffles in halves, which is great. The eggs and toast are good, the homemade hot sausage is excellent. The drawback are the barely cooked potatoes.\n\nIf you're hungry, get \"The Mixed Grill\"... Gab and Eat's brand of the \"kitchen sink\" breakfast that all Midwest places are about.
['<s>', '<s>', 'If', 'you', 'want', 'a', 'true', 'understanding', 'of', 'Pittsburgh', 'in', 'the', 'morning', ',', 'come', 'here', '.', 'This', 'greasy', 'spoon', 'is', 'always', 'packed', ',', 'and', 'is', 'one', 'of', 'the', 'better', 'of', 'its', 'kind', 'south', 'of', 'the', 'city.\\n\\nThey', 'serve', 'waffles', 'in', 'halves', ',', 'which', 'is', 'great', '.', 'The', 'eggs', 'and', 'toast', 'are', 'good', ',', 'the', 'homemade', 'hot', 'sausage', 'is', 'excellent', '.', 'The', 'drawback', 'are', 'the', 'barely', 'cooked', '

In [None]:
# construimos el vocab
vocab = build_vocab_from_iterator(
    map(tokenize_doc, texts_train), specials=['<unk>'], min_freq=2)
vocab.set_default_index(vocab['<unk>']) # va a devolver este index si pedimos OOV

In [None]:
vocab["<unk>"], vocab["riquelme"], vocab["the"], vocab["area"], vocab["<s>"]

(0, 0, 2, 217, 11)

In [None]:
# veamos un ejemplo:
print(texts_train[33])
x = doc2tensor(texts_train[33], vocab, ngram_order=3)
print(x)
print(x.shape)

If you want a true understanding of Pittsburgh in the morning, come here. This greasy spoon is always packed, and is one of the better of its kind south of the city.\n\nThey serve waffles in halves, which is great. The eggs and toast are good, the homemade hot sausage is excellent. The drawback are the barely cooked potatoes.\n\nIf you're hungry, get \"The Mixed Grill\"... Gab and Eat's brand of the \"kitchen sink\" breakfast that all Midwest places are about.
tensor([   11,    11,   160,    21,   144,     6,   955,  3315,     9,   102,
           14,     2,   576,     3,   162,    47,     1,   103,   683,  2455,
           13,   115,   569,     3,     4,    13,    57,     9,     2,   123,
            9,   288,   286,  2794,     9,     2,     0,   587,  4683,    14,
            0,     3,    65,    13,    74,     1,    22,   789,     4,   834,
           34,    39,     3,     2,  1056,   250,   639,    13,   399,     1,
           22,  4232,    34,     2,   911,   350,     0,    21,   1

In [None]:
# las primeras 5 muestras de entrenamiento de este doc:
doc2ngrams(texts_train[33], vocab)[:5]

[(tensor([11, 11]), tensor(160)),
 (tensor([ 11, 160]), tensor(21)),
 (tensor([160,  21]), tensor(144)),
 (tensor([ 21, 144]), tensor(6)),
 (tensor([144,   6]), tensor(955))]

In [None]:
# armamos los ngrams de training
ngrams_train = []
for doc in texts_train:
  ngrams_train.extend(doc2ngrams(doc, vocab, ngram_order=3))

## Modelo

Armamos una red bien sencilla con una hidden layer. Es la misma arquitectura que Figure 7.17 de Jurafksy.

**OJO:**

* Si vamos a usar [Cross Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) no tenemos que aplicar softmax porque espera "raw, unnormalized scores for each class".
* En cambio [NLLLoss](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) espera que usemos log_softmax.

In [None]:
class NGramLanguageModel(nn.Module):

    def __init__(self, vocab_size, embedding_dim, hidden_size, ngram_order):
        super().__init__()
        context_size = ngram_order - 1
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, hidden_size)
        self.linear2 = nn.Linear(hidden_size, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs) # shape (bsz, context_size, embed_dim)
        # concatena vectores de contexto
        embeds = embeds.flatten(1) # shape (bsz, context_size * embed_dim)
        hidden = F.relu(self.linear1(embeds))
        z = self.linear2(hidden)
        log_probas = F.log_softmax(z, dim=1)
        return log_probas

# # Alternativa equivalente:
# from collections import OrderedDict

# class NGramLanguageModel(nn.Module):

#     def __init__(self, vocab_size, embedding_dim, hidden_size, ngram_order):
#         super().__init__()
#         context_size = ngram_order - 1
#         self.model = nn.Sequential(OrderedDict([
#             ('embeddings', nn.Embedding(vocab_size, embedding_dim)),
#             ('flatten', nn.Flatten(1)),
#             ('linear1', nn.Linear(context_size * embedding_dim, hidden_size)),
#             ('relu', nn.ReLU()),
#             ('linear2', nn.Linear(hidden_size, vocab_size)),
#             ('log_softmax', nn.LogSoftmax(dim=1))
#         ]))
#     def forward(self, inputs):
#         return self.model(inputs)

## Training

Hacemos un `DataLoader` con nuestros ngrams de entrenamiento. Esta clase nos sirve para ir procesando los samples en batches durante el entrenamiento.

In [None]:
# seed para reproducibilidad (https://pytorch.org/docs/stable/notes/randomness.html#dataloader)
g = torch.Generator()
g.manual_seed(33)

<torch._C.Generator at 0x7a012cba7d90>

In [None]:
train_dataloader = DataLoader(
    ngrams_train, batch_size=32, shuffle=True, generator=g)

In [None]:
# revisamos el primer batch del generador
batch_example = next(iter(train_dataloader))[0]
print(batch_example)
batch_example.shape
# son batchsize ejemplos con 2 IDs cada uno (los 2 vectores de contexto)

tensor([[    1,    22],
        [ 2124,    17],
        [    9,   328],
        [ 1177,    59],
        [  703,     3],
        [    0,    63],
        [  123,     1],
        [  161,     3],
        [  422,    16],
        [  112,     8],
        [   11,    11],
        [  101,    88],
        [    2,  5681],
        [   98,   985],
        [    4,     0],
        [    1,     5],
        [   11,    11],
        [  887,     8],
        [   99, 13755],
        [ 2690,     4],
        [  364,  2406],
        [ 2604,    51],
        [   16,     5],
        [  213,     2],
        [   40,     1],
        [  549,     1],
        [ 1208,     5],
        [    0,    95],
        [   13,     6],
        [   38,   110],
        [  153,     1],
        [    4,   598]])


torch.Size([32, 2])

In [None]:
def train(
    loss_function, optimizer, model, train_dataloader, num_epochs, device=None):
  """Entrena iterando por epoch.
  """
  for epoch in range(num_epochs):
      epoch_loss = train_epoch(loss_function, optimizer, model, train_dataloader, device=device)
      print(f"Epoch {epoch+1} / Loss {epoch_loss:.3f}")

def train_epoch(loss_function, optimizer, model, ngrams_loader, device=None):
    """Entrena 1 epoch
    """
    total_loss = 0
    num_batches = 0
    for context, target in ngrams_loader:
        if device:
            context = context.to(device)
            target = target.to(device)
        # 1. Ponemos a cero el gradiente
        optimizer.zero_grad()
        # 2. Forward pass (log probabilities over next words)
        log_probas = model(context)
        # 3. Compute loss function
        loss = loss_function(log_probas, target)
        # 4. Backward pass (computa gradientes)
        loss.backward()
        # 5. Actualiza pesos
        optimizer.step()
        # Get loss
        total_loss += loss.item()
        num_batches += 1
    return total_loss / num_batches

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
# Podemos activar gpu en notebook settings al principio

cuda


In [None]:
model = NGramLanguageModel(
    vocab_size=len(vocab),
    embedding_dim=50,
    hidden_size=16,
    ngram_order=3,
)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
loss_function = nn.NLLLoss() # Cross Entropy

model = model.to(device)

In [None]:
# entrenamos! (1 epoch para ahorrar tiempo)
train_dataloader = DataLoader(ngrams_train, batch_size=32, shuffle=True, generator=g)
num_epochs = 1
train(loss_function, optimizer, model, train_dataloader, num_epochs=num_epochs, device=device)

Epoch 1 / Loss 5.953


## Evaluación

Computamos perplexity y vemos cómo generar texto aleatorio.

In [None]:
def text2input(context_str: str, vocab: Vocab, ngram_order: int = 3) -> Tensor:
    """Convierte contexto en un input para la NN (tensor de context IDs)
    """
    ngrams = doc2ngrams(context_str, vocab, ngram_order=ngram_order)
    # el input es el contexto del ultimo ngram
    last_context = ngrams[-1][0]
    # agregamos una dimension que hace las veces de batch (size=1) para hacer el forward
    out = last_context.unsqueeze(0)
    return out

def idx2str(itos: list, input: Tensor) -> list:
    """De vocab ID a token
    """
    res = [itos[i] for i in input]
    return res

def sample_text(model, vocab, start_text, max_length=10, ngram_order=3):
    """Generación autorregresiva aleatoria de texto sampleando de softmax.
    El modelo debe ser consistente con ngram_order.
    """
    # buscamos los input IDs segun el context size
    input_ = text2input(start_text, vocab, ngram_order=ngram_order)
    # get model device para mandar inputs al mismo device
    device = next(model.parameters()).device
    input_ = input_.to(device)
    idx_eos = vocab.get_stoi()["</s>"]
    context_size = ngram_order - 1
    itos = vocab.get_itos()
    # el resultado solo va a incluir el contexto usado segun los ngrams + el texto nuevo
    idxs_result = input_.clone()
    with torch.no_grad():  # no need to track gradients in inference
        for i in range(max_length):
            output_ = model(input_) # log_softmax scores
            # output es < 0 -- tenemos que aplicar exp para samplear de la
            # softmax con torch.multinomial
            sampled_idx = torch.multinomial(output_.exp(), num_samples=1)
            if sampled_idx == idx_eos: # break if </s>
                break
            # actualizamos el resultado
            idxs_result = torch.cat((idxs_result, sampled_idx), dim=1)
            # actualizamos el input conservando solo los ultimos context_size tokens
            input_ = idxs_result[:,-context_size:]
        tokens_result = idx2str(itos, idxs_result.squeeze())
        return tokens_result

In [None]:
torch.manual_seed(0)
start_text = "The place is"
res_ = sample_text(model, vocab, start_text, max_length=10)

print(res_)

['place', 'is', 'decent', '<unk>', 'A', 'enough', 'and', 'kissed', 'you', 'have', 'no', 'places']


In [None]:
torch.manual_seed(22)
start_text = ""
res_ = sample_text(model, vocab, start_text, max_length=35)

res_

['<s>',
 '<s>',
 'go',
 'overall',
 '.',
 'The',
 'dirty',
 'side',
 'with',
 'my',
 'server',
 'also',
 'called',
 'about',
 'no',
 'biggest',
 'sum',
 'it',
 "'s",
 'best',
 'is',
 'how',
 'a',
 'Friday',
 'and',
 'downright',
 'cheese',
 '.',
 'return',
 'to',
 'the',
 'cool',
 'and',
 'desert',
 'a',
 'hungry',
 'way']

Ahora calculamos perplexity (PPL).

Hacemos $ \exp(\log PPL ) $ para evitar underflow.

Vean que $\log PPL = CrossEntropy = -avg(\log(probas))$

In [None]:
# ngrams de test (lo hacemos solo para el primer doc)
ngrams_test = doc2ngrams(texts_test[0], vocab, ngram_order=3)

In [None]:
test_dataloader = DataLoader(ngrams_test, batch_size=32, shuffle=False)

In [None]:
def perplexity(model, dataloader, device):
    with torch.no_grad():
        # Iteramos por batch. Vamos a ir guardando las probas de los tokens correctos en cada batch.
        all_log_probas_correct = torch.tensor([], device=device)
        for context, target in dataloader:
            if device:
                context = context.to(device)
                target = target.to(device)
            batch_size = len(target)
            log_probas = model(context) # shape (bsz, vocab_size)
            log_probas_correct = log_probas[torch.arange(batch_size), target] # extraemos la proba del token correcto
            all_log_probas_correct = torch.cat((all_log_probas_correct, log_probas_correct))
            # NOTE tambien podemos usar la loss que equivale a mean(-log(proba_clase_correcta)):
            # loss = loss_function(log_probas, target) # esto es el promedio
            # equivale a:
            # loss2 = torch.mean(-log_probas_correct)
        res = torch.exp(-all_log_probas_correct.mean())
    return res.item()

In [None]:
perplexity(model, test_dataloader, device)

261.5930480957031

In [None]:
# es correcto usar perplexity si tenemos una distribucion de probas. dado el contexto

## Referencias

* https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#an-example-n-gram-language-modeling
* https://pytorch.org/tutorials/beginner/transformer_tutorial.html
* https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html
* https://pytorch.org/docs/stable/notes/autograd.html