## Language Modeling con ngrams

Vamos a usar `nltk` para el modelo y `datasets` de HF para el corpus.

In [1]:
%%capture
!pip install datasets==2.11.0

In [2]:
%%capture
!python -m spacy download en_core_web_sm # para tokenizar

In [3]:
import re

import numpy as np
from nltk.util import ngrams
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE, Vocabulary, Lidstone
from datasets import load_dataset
from torchtext.data.utils import get_tokenizer

## Data

Vamos a usar un corpus de reviews en yelp solo a modo ilustrativo. Cada documento con todos sus atributos (texto, tags, etc.) es un "example" o "row".

Lean el [brevísimo tutorial de HF sobre `datasets`](https://huggingface.co/docs/datasets/tutorial) para empezar a manejarlos.

In [4]:
dataset = load_dataset("yelp_review_full")

Downloading builder script:   0%|          | 0.00/4.41k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.55k [00:00<?, ?B/s]

Downloading and preparing dataset yelp_review_full/yelp_review_full to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...


Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
# vemos la estructura:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})


In [6]:
# vemos un review al azar:
dataset["train"][33]

{'label': 2,
 'text': 'If you want a true understanding of Pittsburgh in the morning, come here. This greasy spoon is always packed, and is one of the better of its kind south of the city.\\n\\nThey serve waffles in halves, which is great. The eggs and toast are good, the homemade hot sausage is excellent. The drawback are the barely cooked potatoes.\\n\\nIf you\'re hungry, get \\"The Mixed Grill\\"... Gab and Eat\'s brand of the \\"kitchen sink\\" breakfast that all Midwest places are about.'}

In [7]:
# lo achicamos para trabajar mas rapido: 5k train, 5k test
dataset["train"] = dataset["train"].select(range(0, 5_000))
dataset["test"] = dataset["test"].select(range(0, 5_000))

## Tokenización

`nltk` espera que cada documento sea una lista de strings. Para eso primero tenemos que tokenizar los documentos. 

Ahora vamos a usar el tokenizer para inglés de `spacy` (instanciado desde `torchtext`) y en las próximas clases vamos a usar otros más sofisticados. 

In [8]:
# tokenizer default para ingles con reglas de puntacion, contracciones, etc:
tokenizer = get_tokenizer('spacy')



In [9]:
# veamos un ejemplo
texto_ejemplo = "But I don't want nothing at all... if it ain't you, baby"
resultado_ejemplo = tokenizer(texto_ejemplo)
print(type(resultado_ejemplo))
print(resultado_ejemplo)

<class 'list'>
['But', 'I', 'do', "n't", 'want', 'nothing', 'at', 'all', '...', 'if', 'it', 'ai', "n't", 'you', ',', 'baby']


In [10]:
def tokenize_example(example):
  """fn para mapear sobre dataset. Tokeniza el texto y lo agrega a cada example
  del dataset. Tiene que devolver un dict para agregar los tokens como 
  atributos del dataset.
  """
  # limpieza muy simple: reemplaza todo whitespace por un solo espacio
  text = re.sub(r'\s+', ' ', example["text"])
  tokens = tokenizer(text)
  # return dict para hacer update del dataset inplace
  return {"tokens": tokens}

In [11]:
# mas adelante vamos a trabajar con batches para acelerar el procesamiento
dataset = dataset.map(tokenize_example, batched=False)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [12]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'tokens'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['label', 'text', 'tokens'],
        num_rows: 5000
    })
})


In [13]:
# veamos un ejemplo
print(dataset["train"]["tokens"][33][:10]) # 10 primeros tokens

['If', 'you', 'want', 'a', 'true', 'understanding', 'of', 'Pittsburgh', 'in', 'the']


In [14]:
# para LM con nltk solo necesitamos conservar los tokens (como una lista de listas de tokens)
tokenized_train = dataset["train"]["tokens"]
tokenized_test = dataset["test"]["tokens"]

In [15]:
print(tokenized_train[33][:10])

['If', 'you', 'want', 'a', 'true', 'understanding', 'of', 'Pittsburgh', 'in', 'the']


## Modelo

Vamos a usar 3-grams, entonces necesitamos hacer padding con 3 BOS y EOS tokens.

El ngram LM más sencillo es el MLE (Maximum Likelihood Estimator). 

In [16]:
# nltk usa lazy iterators en train y vocab para evitar recrear todos los docs en memoria
# se evaluan on demand durante training
train, vocab_ = padded_everygram_pipeline(3, tokenized_train)

In [17]:
print(padded_everygram_pipeline.__doc__)

Default preprocessing for a sequence of sentences.

    Creates two iterators:

    - sentences padded and turned into sequences of `nltk.util.everygrams`
    - sentences padded as above and chained together for a flat stream of words

    :param order: Largest ngram length produced by `everygrams`.
    :param text: Text to iterate over. Expected to be an iterable of sentences.
    :type text: Iterable[Iterable[str]]
    :return: iterator over text as ngrams, iterator over text as vocabulary data
    


In [18]:
# cutoff de freq>=2 para el vocab:
vocab = Vocabulary(vocab_, unk_cutoff=2)

In [19]:
# los tokens menos y más frecuentes:
print(sorted(vocab.counts, key=vocab.counts.get)[:5])
print(sorted(vocab.counts, key=vocab.counts.get, reverse=True)[:5])

['goldberg', 'practitioner', 'nyu', 'referrals', 'drawing']
['.', 'the', ',', 'and', 'I']


In [20]:
# los tokens ordenados alfabeticamente:
print(sorted(vocab.counts)[:5])
print(sorted(vocab.counts, reverse=True)[:5])
# podriamos mejorar el preprocesamiento!

['!', '"', '#', '$', '$20']
['~75', '~6', '~40%+', '~35', '~24\\']


In [21]:
# los tokens con frec 1 "no están en el vocab" (pero podemos consultar su frec.)
print(vocab["goldberg"], "goldberg" in vocab)
print(vocab[" "], " " in vocab)
print(vocab["boquita"], "boquita" in vocab)
print(vocab["the"], "the" in vocab)

1 False
0 False
0 False
29137 True


In [22]:
# ejemplo de sequencia tokenizada:
print(vocab.lookup(tokenized_train[3][:10]))

('Got', 'a', 'letter', 'in', 'the', 'mail', 'last', 'week', 'that', 'said')


In [23]:
# otro ejemplo de sequencia tokenizada:
print(vocab.lookup(["the", "goldberg", "boquita", "."]))

('the', '<UNK>', '<UNK>', '.')


In [24]:
# nro de tokens que quedaron
len(vocab)

14238

In [25]:
# instanciamos el modelo con el highest ngram order
lm = MLE(3, vocabulary=vocab)

In [26]:
%%time
lm.fit(train)

CPU times: user 18.7 s, sys: 610 ms, total: 19.3 s
Wall time: 20.4 s


In [27]:
print(lm.vocab)

<Vocabulary with cutoff=2 unk_label='<UNK>' and 14238 items>


In [28]:
print(lm.counts)

<NgramCounter with 3 ngram orders and 2375001 ngrams>


In [29]:
# unigram counts
lm.counts['the']

29137

In [30]:
# bigram counts
print(lm.counts[['in']]["the"])
print(lm.counts[['the']]["in"])
print(lm.counts[['the']]["<UNK>"])

2129
1
1268


In [31]:
# cada doc tiene padding con 2 BOS y EOS
print(lm.counts[["<s>"]]["<s>"])
print(lm.counts[["</s>"]]["</s>"])

5000
5000


In [32]:
# ("<s>", "<s>", 'Got', 'a', 'letter', 'in', 'the', 'mail', '.', "</s>", "</s>")

In [33]:
# trigram counts
print(lm.counts[["in", "the"]]["mail"])
print(lm.counts[["the", "simple"]]["fact"])

3
0


In [34]:
# lo mas frecuente despues de un bigram:
bigram_example = ["in", "the"]
sorted(lm.counts[bigram_example].items(), key=lambda x: x[1], reverse=True)[:10]

[('area', 113),
 ('<UNK>', 90),
 ('back', 76),
 ('city', 70),
 ('middle', 63),
 ('past', 47),
 ('Strip', 46),
 ('restaurant', 38),
 ('mood', 35),
 ('morning', 31)]

In [35]:
# probabilidad de un token luego de un bigrama:
print(lm.score("area", bigram_example))

0.05307656176608737


In [36]:
# usamos logscore para evitar underflow
print(lm.logscore("area", bigram_example))
print(np.log2(lm.score("area", bigram_example)))

-4.235781272037106
-4.235781272037106


## Evaluación

In [37]:
example_test = tokenized_test[0]
print(example_test)

['I', 'got', "'", 'new', "'", 'tires', 'from', 'them', 'and', 'within', 'two', 'weeks', 'got', 'a', 'flat', '.', 'I', 'took', 'my', 'car', 'to', 'a', 'local', 'mechanic', 'to', 'see', 'if', 'i', 'could', 'get', 'the', 'hole', 'patched', ',', 'but', 'they', 'said', 'the', 'reason', 'I', 'had', 'a', 'flat', 'was', 'because', 'the', 'previous', 'patch', 'had', 'blown', '-', 'WAIT', ',', 'WHAT', '?', 'I', 'just', 'got', 'the', 'tire', 'and', 'never', 'needed', 'to', 'have', 'it', 'patched', '?', 'This', 'was', 'supposed', 'to', 'be', 'a', 'new', 'tire', '.', '\\nI', 'took', 'the', 'tire', 'over', 'to', 'Flynn', "'s", 'and', 'they', 'told', 'me', 'that', 'someone', 'punctured', 'my', 'tire', ',', 'then', 'tried', 'to', 'patch', 'it', '.', 'So', 'there', 'are', 'resentful', 'tire', 'slashers', '?', 'I', 'find', 'that', 'very', 'unlikely', '.', 'After', 'arguing', 'with', 'the', 'guy', 'and', 'telling', 'him', 'that', 'his', 'logic', 'was', 'far', 'fetched', 'he', 'said', 'he', "'d", 'give', 

In [38]:
print(lm.vocab.lookup(example_test))

('I', 'got', "'", 'new', "'", 'tires', 'from', 'them', 'and', 'within', 'two', 'weeks', 'got', 'a', 'flat', '.', 'I', 'took', 'my', 'car', 'to', 'a', 'local', 'mechanic', 'to', 'see', 'if', 'i', 'could', 'get', 'the', 'hole', '<UNK>', ',', 'but', 'they', 'said', 'the', 'reason', 'I', 'had', 'a', 'flat', 'was', 'because', 'the', 'previous', 'patch', 'had', 'blown', '-', 'WAIT', ',', 'WHAT', '?', 'I', 'just', 'got', 'the', 'tire', 'and', 'never', 'needed', 'to', 'have', 'it', '<UNK>', '?', 'This', 'was', 'supposed', 'to', 'be', 'a', 'new', 'tire', '.', '\\nI', 'took', 'the', 'tire', 'over', 'to', '<UNK>', "'s", 'and', 'they', 'told', 'me', 'that', 'someone', '<UNK>', 'my', 'tire', ',', 'then', 'tried', 'to', 'patch', 'it', '.', 'So', 'there', 'are', 'resentful', 'tire', '<UNK>', '?', 'I', 'find', 'that', 'very', 'unlikely', '.', 'After', 'arguing', 'with', 'the', 'guy', 'and', 'telling', 'him', 'that', 'his', 'logic', 'was', 'far', '<UNK>', 'he', 'said', 'he', "'d", 'give', 'me', 'a', 'n

In [39]:
def perplexity(tokens, lm, ngram_order=3) -> float:
    """Tenemos que generar los ngrams con padding "a mano" en test, procurando
    que sea el mismo criterio que en train.
    NOTE para evaluar perplexity en muchos docs deberiamos generar una lista de ngrams
    de todos los docs
    """
    ngrams_padded = ngrams(
        tokens, ngram_order, pad_right=True, pad_left=True, left_pad_symbol="<s>",
        right_pad_symbol="</s>")
    return lm.perplexity(list(ngrams_padded))

In [40]:
example_train = tokenized_train[33]
perplexity(example_train, lm)

8.110706034849242

In [41]:
# necesitamos smoothing / backoff / interpolation para computar perplexity en test!
perplexity(example_test, lm)

inf

In [43]:
# usamos add-k smoothing (aka Lidstone smoothing, gamma=k) 
train, vocab_ = padded_everygram_pipeline(3, tokenized_train)
vocab = Vocabulary(vocab_, unk_cutoff=2)
lm_smoothed = Lidstone(order=3, vocabulary=vocab, gamma=.01)

In [44]:
%%time
lm_smoothed.fit(train)

CPU times: user 16.9 s, sys: 196 ms, total: 17.1 s
Wall time: 17.3 s


In [45]:
perplexity(example_test, lm_smoothed)

12482.455885280744

In [46]:
# podemos generar texto sampleando
tokens_ = lm.generate(30, text_seed=["<s>", "<s>"], random_seed=33)
tokens_
#print(" ".join(tokens_))

['Mt.',
 'Lebanon',
 'for',
 'average',
 'seafood',
 '(',
 'or',
 'at',
 'least',
 'marginally',
 'better',
 '...',
 'because',
 'in',
 'certain',
 'spots',
 'it',
 "'s",
 'just',
 'Pittsburgh',
 ',',
 'offers',
 'more',
 'than',
 'what',
 'was',
 'served',
 '2',
 'slices',
 'of']

In [47]:
tokens_ = lm_smoothed.generate(10, text_seed=["<s>", "<s>"], random_seed=33) 
print(" ".join(tokens_))

Much like the cast is touring , I quickly ordered


In [None]:
# otras alternativas:
# AbsoluteDiscountingInterpolated
# WittenBellInterpolated
# KneserNeyInterpolated 
# katz backoff ya no esta implementado en nltk

## Referencias

* https://www.nltk.org/api/nltk.lm.html
* https://www.nltk.org/_modules/nltk/lm/api.html
* https://www.nltk.org/howto/lm.html
* https://www.nltk.org/api/nltk.lm.vocabulary.html