# Text-generation with the GPT architecture

<a href="https://colab.research.google.com/drive/1YN6lkDLGiCD7Xdv0WYJ6al9WiSBWFwGC" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).

`Text generation` models are a type of machine learning model that can generate natural language text. These models have a wide range of applications, including `language translation`, `summarization`, and `text generation`. There are several different approaches to text generation, including `statistical models`, `rule-based system`, and, more recently, `neural network-based models`.

Neural language models can be trained on large amounts of input text data and use this information to generate new text that is coherent and reflects the patterns and structures found in the training data. The quality of the generated text can vary depending on the _complexity of the model and the amount and quality of the training data_. Overall, text-generation models have the potential to revolutionize many industries by automating many tasks that involve the production and analysis of text.

One of the biggest and most current advances in language modeling was made possible by the invention of the `transformer` architecture. A `transformer` model is a type of neural network architecture first described in the 2017 paper "_[Attention Is All You Need](https://arxiv.org/abs/1706.03762)_" by Vaswani et al. Transformers were originally used for machine translation, but this adaptable architecture is now used in a wide range of fields and problems.

<img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" alt="drawing" height="450"/>

[Source](https://machinelearningmastery.com/the-transformer-model/).

Before the `transformer`, most neural language models were based on recurrent neural networks ([`RNNs`](https://en.wikipedia.org/wiki/Recurrent_neural_network)). While effective, `RNNs` can be slow and difficult to parallelize, limiting their ability to scale. The `transformer` architecture was designed to overcome these limitations, efficiently processing long data sequences, while also being a very paralelizeble model.

The `transformer` architecture has enabled the development of powerful language models such as [`BERT`](https://huggingface.co/docs/transformers/model_doc/bert) and [`GPT-3`](https://arxiv.org/abs/2005.14165), which have achieved state-of-the-art results on a wide range of natural language processing tasks.

To learn more about the original `transformer` architecture, go to our [`sequence-to-sequence` machine translation](https://github.com/Nkluge-correa/TeenyTinyCastle/blob/48d415094d30e0e5bc8dde32715bb57428a87d7d/ML-Intro-Course/16_sequence_to_sequence.ipynb) notebook.

In this notebook, we will implement a `decoder-only transformer`. `Decoder-only transformers`, such as the `GPT` (_Generative Pre-training Transformer_) series, are models that consist only of the `decoder` portion of the original transformer, which is then trained on a causal language modeling task (i.e., based on $n$ tokens, predict $n+1$).

To start building our model, we first need _good-quality text_. Good-quality text is important for the training of language models for several reasons. For example, good-quality text is more likely to reflect the real-world patterns and structures of language, which is important for the model to learn. A model trained on poorly written or grammatically incorrect text may struggle to generate correct and coherent output. Also, a model trained on large amounts of high-quality text may be able to learn language patterns more quickly and with fewer resources than a model trained on low-quality text.

For this tutorial, we will create a text dataset using the articles from the [`Stanford Encyclopedia of Philosophy`](https://plato.stanford.edu/). Details on the created text corpus can be found on this [dataset card](https://huggingface.co/datasets/AiresPucrs/stanford-encyclopedia-philosophy), and the
full dataset can be downloaded from the Hub. 🤗

## Getting a Text Corpus

Web scraping is the process of extracting information from websites. This can be done using a variety of programming languages and tools, such as Python and its libraries for web scraping, such as `BeautifulSoup` and `Scrapy`.

Web scraping can be useful but can be done in an unethical way and even illegally, so it is important to be aware of the website's terms of use before scraping. Many websites have terms of service that prohibit scraping, so it's important to review the terms of service of a website before scraping it.

If we check the [`robots.txt`](https://plato.stanford.edu/robots.txt) file of the SEP, we see what we are allowed to do, and the following code will only scrape permissible content of the SEP.

In this example, we will use `BeautifulSoup` to get our text data.

In [None]:
!pip install keras-nlp tensorflow==2.11 tensorflow-text==2.11 --upgrade -q

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

# Set the URL to scrape
url = "https://plato.stanford.edu/contents.html"

# Send a request to the URL and retrieve the webpage
page = requests.get(url)

# Parse the webpage with BeautifulSoup
soup = BeautifulSoup(page.text, "html.parser")

# parse the href addresses and anchors of the html page
definitions = soup.find_all("a")
quoted = re.compile('"[^"]*"')

entries = []
for definition in definitions:
    definition = str(definition)

    # get all the addresses of the links in the 'contents.html' page
    for value in quoted.findall(definition):

        # get all pages, that have philosophical text
        if value[1:-1].startswith("entries"):
            entries.append(value[1:-1])

# list of all the texts
paragraphs = []

# list of all the source pages
source = []

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m584.5/584.5 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m61.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m71.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m63.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.2/439.2 kB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

Now that we have all the entries, we can begin to extract the text from them (note that the entries are allowed according to the [`robots.txt`](https://plato.stanford.edu/robots.txt)).

⚠️ ALWAYS BE SURE YOU ARE ALLOWED TO SCRAPE. ⚠️

In [None]:
print(f'Number of content pages: {len(entries)}')
# loop over all the pages that have philosophical text
for i, entri in enumerate(entries):
    url = f"https://plato.stanford.edu/{entri}"

    # Send a request to the URL and retrieve the webpage
    page = requests.get(url)

    # Parse the webpage with BeautifulSoup
    soup = BeautifulSoup(page.text, "html.parser")

    # Get all the <p> tags from the parsed html
    texts = soup.find_all("p")

    # Loop over all the <p> tags from the parsed html
    for text in texts:

        # remove the html tags from the string
        clean_string = re.sub(r'<[^>]*>', '', str(text))

        # replace the '\n' with " "
        clean_string = clean_string.replace("\n", " ")

        # append the source and text elements
        paragraphs.append(clean_string)
        source.append(url)

    print(f"Page {i + 1}. Scrapped page '{url}'!")

Now, let us create a dataframe with all of our text corpus. First, we will create a " SEP " folder in the "/content" directory to store our future files.

In [None]:
import os

os.mkdir('./SEP')

# create a pandas data frame with the data
df = pd.DataFrame({'text': paragraphs, 'metadata': source})

# drop duplicate text
df = df.drop_duplicates()

# Clean the URL to get the "Category" of the page
def clean_url(string):
    return string.split('entries/')[1][:-1]

# Apply the function to the "URL" column
df['category'] = df['metadata'].apply(clean_url)

# save as a csv file
df.to_parquet('SEP/stanford-encyclopedia-philosophy.parquet', compression='gzip')

And here is our text corpus. You can download it directly form the Hub using the following commands:

```python
from datasets import load_dataset

dataset = load_dataset("AiresPucrs/stanford-encyclopedia-philosophy")
```

In [None]:
import pandas as pd

"""
!pip install datasets -q
from datasets import load_dataset

dataset = load_dataset("AiresPucrs/stanford-encyclopedia-philosophy", split='train')
df = dataset.to_pandas()
"""

df = pd.read_parquet('SEP/stanford-encyclopedia-philosophy.parquet')
display(df)

Unnamed: 0,text,metadata,category
0,"In the philosophical literature, the term “ab...",https://plato.stanford.edu/entries/abduction/,abduction
1,This entry is exclusively concerned with abdu...,https://plato.stanford.edu/entries/abduction/,abduction
2,"See also the entry on scientific discovery, ...",https://plato.stanford.edu/entries/abduction/,abduction
3,Most philosophers agree that abduction (in th...,https://plato.stanford.edu/entries/abduction/,abduction
4,You happen to know that Tim and Harry have re...,https://plato.stanford.edu/entries/abduction/,abduction
...,...,...,...
256791,Many thanks to David Chalmers and to Bill Fis...,https://plato.stanford.edu/entries/zombies/,zombies
256792,Copyright © 2023 by Robert Kirk &lt;Robert....,https://plato.stanford.edu/entries/zombies/,zombies
256793,View this site from another server:,https://plato.stanford.edu/entries/zombies/,zombies
256794,The Stanford Encyclopedia of Philosophy is cop...,https://plato.stanford.edu/entries/zombies/,zombies


The loop below will create a dataset folder. The folder will contain a folder for each topic in the SEP dataset.

In [None]:

# Define the base directory to create subdirectories
base_directory = "SEP/dataset"

# Check if directory exists or create if it does not exist
if not os.path.exists(base_directory):
    os.makedirs(base_directory)

# Iterate through unique categories in the 'df' DataFrame
for category in df.category.unique():
    category_directory = os.path.join(base_directory, category)
    os.mkdir(category_directory)

    dff = df[df['category'] == category]

    for i, sample in enumerate(list(dff.text)):
        with open(os.path.join(category_directory, f'{i}.txt'), 'w', encoding='utf-8') as fp:
            fp.write(sample)

print('Dataset Folder Created!')

Dataset Folder Created!


Using the entire SEP Corpus to train a language model (if you don't have access to powerful GPUs) can take a long time. As a result, for demonstration purposes, we will train our language model on a subset of our corpus.

Our mini-dataset contains only `aesthetics-18th-british`, `aesthetics-18th-french`, `aesthetics-18th-german`, and `aesthetics-19th-romantic`.

In [None]:
import os

filenames = []

directories = ["SEP/dataset/aesthetics-18th-british",
                "SEP/dataset/aesthetics-18th-french",
                "SEP/dataset/aesthetics-18th-german",
                "SEP/dataset/aesthetics-19th-romantic"]

for directory in directories:
    for folder in os.listdir(directory):
        filenames.append(os.path.join(directory, folder))

print(f"Found {len(filenames)} files")


Found 659 files


All the found files are `txt` with some text about the topics selected above.

Now, let us shuffle the order of our samples and create a dataset using the `tf.data.TextLineDataset`, which loads text from text files and creates a dataset where each line of the files becomes an element of the dataset. We also selected a small `batch_size` to avoid OOM (OUT-OF-MEMORY) problems (in case you are using a GPU).

In [None]:
import random
import tensorflow as tf

batch_size = 16

random.shuffle(filenames)

text_ds = tf.data.TextLineDataset(filenames)
text_ds = text_ds.shuffle(buffer_size=256)
text_ds = text_ds.batch(batch_size)

Because we are using a small dataset, we will have a small vocabulary (in this example, we end up with a vocabulary with 8600 unique tokens). This is a very small vocabulary compared to famous large language models (Pythia, Llama, Claude, GPT-4, etc.).

> **Note:** Most Large Language Models have tokenizers trained via [Byte-pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding). Check out our other [repositories to learn how](https://github.com/Nkluge-correa/Aira) to create such tokenizers.

We create our vocabulary using the `tf.keras.layers.TextVectorization`, passing a `custom_standardization` function to lower strings and parse punctuations. Then we adapt the `TextVectorization` layer to our dataset and get our vocabulary out of it. We save the vocabulary in a `txt` file for later use. From this vocabulary, we can detokenize the sequences produced by our language model.

In [None]:
import string

# Will cut sequences with more than 500 tokens
sequence_length = 500

# Maximum vocabulary size
vocab_size = 8700

# Lower all strings and parse punctuation
def custom_standardization(input_string):
    lowercased = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercased, f"([{string.punctuation}])", r" \1")

from keras.layers import TextVectorization

# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size - 1,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
)

# Fit the TextVectorization layer to the dataset
vectorize_layer.adapt(text_ds)

# Get words back from token indices
vocab = vectorize_layer.get_vocabulary()

print(f'Found {len(vocab)} unique tokens!')

# Save the vocabulary as a text file
with open(f'vocabulary.txt', 'w', encoding='utf-8') as fp:
    for word in vocab:
        fp.write("%s\n" % word)
    fp.close()

# Index to detokenize tokens
vocab_index = {}
for index, word in enumerate(vocab):
    vocab_index[word] = index

Found 8600 unique tokens!


To prepare our dataset, we shift word sequences by $1$ position so that the target for the position $i$ is a word at position $i+1$. The model will use all words up to position $i$ to predict the next. Thus, our language model is forced to make predictions in a causal way.

> Note: This is also called causal modeling, i.e., only past tokens can be used to infer the next in the sequence.

In [None]:
def prepare_lm_dataset(text):
    """
    Prepares a language modeling dataset by tokenizing the input text.

    Args:
        text (str or tf.Tensor): The input text to be tokenized.

    Returns:
        tuple: A tuple containing two elements: the input sequences (x) and the target sequences (y).
    """
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y


text_ds = text_ds.map(prepare_lm_dataset)
text_ds = text_ds.prefetch(tf.data.AUTOTUNE)

print('Dataset ready!')

Dataset ready!


In this notebook, we will create a `decoder-only transformer` based on the `GPT` architecture. Our model will use some of the same components we used in our [`sequence-to-sequence`](https://github.com/Nkluge-correa/TeenyTinyCastle/blob/master/ML-Intro-Course/16_sequence_to_sequence.ipynb), like the `PositionalEmbedding` layer (a way to inject temporal information into our model), and the `TransformerDecoder` block.

The combination of these simple blocks gives rise to our `mini-GPT`. If you wish to create a more robust model and give it more data, you could stack more `TransformerDecoder` blocks and create residual connections among them.

In [None]:
from tensorflow import keras
from keras import layers

class PositionalEmbedding(layers.Layer):
    """
    This class creates a positional embedding layer that adds positional information to the input embeddings.
    It takes in the sequence length, input dimension, and output dimension as arguments.
    The call method takes in the inputs and returns the sum of the token embeddings and positional embeddings.
    The compute_mask method returns a boolean mask tensor based on the inputs.
    The get_config method returns the configuration of the layer.
    """
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config


class TransformerDecoder(layers.Layer):
    """
    TransformerDecoder is a class that implements the decoder block of the Transformer model.
    It takes in the input sequence, encoder outputs and an optional mask and returns the decoder output.

    Args:
        embed_dim (int): The dimensionality of the embedding space.
        dense_dim (int): The dimensionality of the dense layer.
        num_heads (int): The number of attention heads.

    Returns:
        The decoder output.
    """
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
          num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
          num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm_2 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm_3 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(0.1)
        self.dropout2 = layers.Dropout(0.1)
        self.dropout3 = layers.Dropout(0.1)
        self.supports_masking = True

    def get_config(self):
        config = super(TransformerDecoder, self).get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(inputs, inputs, attention_mask=causal_mask)
        attention_output_1 = self.dropout1(attention_output_1)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(query=attention_output_1,
                                                value=encoder_outputs,
                                                key=encoder_outputs,
                                                attention_mask=padding_mask)
        attention_output_2 = self.dropout2(attention_output_2)
        attention_output_2 = self.layernorm_2(attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        proj_output = self.dropout3(proj_output)
        return self.layernorm_3(attention_output_2 + proj_output)

We end up with a 6.6M parameters model, which is small for a modern-day, state-of-the-art language model. Keeping things simple, this model has only two attention heads, one decoder block, a vocabulary (_targets of the final dense network_) of 8600 tokens, embedding dimensions of 256, and a dense-latent-dimension of 2048 ($Vocab_{8600}, Embedd_{256}, D_{2048}$).

In [None]:
from tensorflow import keras
import keras_nlp

sequence_length = 500
embed_dim = 256
latent_dim = 2048
num_heads = 2

inputs = keras.Input(shape=(sequence_length,), dtype=tf.int32)
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, x)
outputs = layers.Dense(len(vocab), activation="softmax")(x)
model = keras.Model(inputs, outputs=outputs)

perplexity = keras_nlp.metrics.Perplexity(name="perplexity")

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                optimizer="adam", metrics=perplexity)
model.summary()

Using TensorFlow backend
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 500)]        0           []                               
                                                                                                  
 positional_embedding (Position  (None, 500, 256)    2355200     ['input_1[0][0]']                
 alEmbedding)                                                                                     
                                                                                                  
 transformer_decoder (Transform  (None, 500, 256)    2104576     ['positional_embedding[0][0]',   
 erDecoder)                                                       'positional_embedding[0][0]']   
                                                                     

Now, the last thing we need is to train our model. We are using `keras.callbacks` to save the best model in 30 epochs (for such a small dataset, we don't need more). The callback will monitor the `perplexity` of the model, which is the metric we use to evaluate performance.

`Perplexity` is a metric that measures how well a probabilistic model (such as a language model) predicts a sample. `Perplexity` is defined as 2 to the power of the cross-entropy, which measures the difference between the sample's predicted and true probability distribution.

The formula for `perplexity` is simply the inverse of the exponentiation of the cross-entropy:

$$Perplexity = 2^{-cross\;entropy}$$

$$Cross\;Entropy = -\frac{1}{n}\sum_{i=1}^{n}\log P(w_i)$$

where

- $n$ is the length of the sample.
- $P(w_i)$ is the predicted probability of the $i$-th word in the sample according to the language model (_sum is taken over all words in the sample_).

`Perplexity` reveals how well a model predicts the next word in a sequence and, thus, how well it knows the language. In general, it is used as an evaluation metric in language modeling tasks.

In [None]:
print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

callbacks = [keras.callbacks.ModelCheckpoint("text_gen.h5",
                                                save_best_only=True,
                                                monitor="perplexity",
                                                patience=3,
                                                restore_best_weights=True)]

model.fit(text_ds, verbose=1, epochs=30, callbacks=callbacks)

Version:  2.11.0
Eager mode:  True
GPU is available
Epoch 1/30


  output, from_logits = _get_logits(


Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fddf21ac130>

Having a trained language model on texts related to aesthetic philosophy, we can now use it to generate some text.

To load models created using subclass functions from the Keras API, you need to pass your classes as custom objects (after building the classes ...):

```python

model = keras.models.load_model("your_model.keras",
    custom_objects={"TransformerDecoder": TransformerDecoder,
        "PositionalEmbedding": PositionalEmbedding,
        "perplexity": keras_nlp.metrics.Perplexity})

```

In [None]:
from keras.layers import TextVectorization
from tensorflow import keras
import tensorflow as tf
from keras import layers
import numpy as np
import keras_nlp

TextGenerator = keras.models.load_model("text_gen.h5",
    custom_objects={"PositionalEmbedding": PositionalEmbedding,
        "TransformerDecoder": TransformerDecoder,
        "perplexity": keras_nlp.metrics.Perplexity})

We also load our vocabulary to create a detokenization function for the outputs of our model. With this vocabulary, we create a TextVectorization layer to tokenize our prompt inputs, passing our vocabulary so we don't have to adapt it again.

In [None]:
with open('vocabulary.txt', encoding='utf-8') as fp:
    vocab = [line.strip() for line in fp]
    fp.close()

vocab_index = {}
for index, word in enumerate(vocab):
    vocab_index[word] = index

text_vectorization = TextVectorization(max_tokens=len(vocab),
                                        output_mode="int",
                                        output_sequence_length=500,
                                        vocabulary=vocab)

With all of these parts ready, we can create functions to:

1. Sample from our model (thus producing less deterministic outputs).
2. Detokenize our samples.
3. Generate some text! 🎉

In [None]:

def sample_from(logits, chose_from):
    """
    This function allows us to sample from the
    probability distribution output of our
    model with a "chose_from = 1", the model will
    always argmax. But with a value higher than 1,
    this function will sample a token, randomly from
    the top n (chose_from = n) of the distribution
    """

    logits, indices = tf.math.top_k(logits, k=chose_from, sorted=True)
    indices = np.asarray(indices).astype("int32")
    preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
    preds = np.asarray(preds).astype("float32")
    return np.random.choice(indices, p=preds)


def detokenize(number):
    """
    Function to turn tokens back into words ...
    """
    return vocab[number]

def generate_text(start_tokens, generate_tokens, vobab, chose_from):
    """
    This function takes as input a sequence of "start_tokens"
    (a prompt), a number of tokens to be generated, a vocabulary,
    and the "chose_from" parameter. The function will output a
    sequence of words, generated by the model, using the sampling
    method established. You can change the number of the "chose_from"
    argument to make the output less repetitive and more random.
    """

    start_tokens = [_ for _ in start_tokens]
    num_tokens_generated = 0
    tokens_generated = []

    while num_tokens_generated <= generate_tokens:

        pad_len = sequence_length - len(start_tokens)
        sample_index = len(start_tokens) - 1
        if pad_len < 0:
            x = start_tokens[:sequence_length]
            sample_index = sequence_length - 1
        elif pad_len > 0:
            x = start_tokens + [0] * pad_len
        else:
            x = start_tokens

        x = np.array([x])
        y = TextGenerator.predict(x, verbose=0)
        sample_token = sample_from(y[0][sample_index], chose_from)
        tokens_generated.append(sample_token)
        start_tokens.append(sample_token)
        num_tokens_generated = len(tokens_generated)

    return " ".join([detokenize(_) for _ in start_tokens + tokens_generated])

start_prompt = "debates about artistic matters"
start_tokens = [vocab_index.get(_, 1) for _ in start_prompt.split()]
generate_tokens = 50

generate_text(start_tokens, generate_tokens, vocab, 1)

'debates about artistic matters were greatly influenced by the new spaces and means of communication that emerged in the seventeenth and eighteenth centuries . critics expressed their judgments in published treatises and in periodicals such as le mercure galant ; philosophical ideas were also developed in oral conversations between members of the newly founded royal were greatly influenced by the new spaces and means of communication that emerged in the seventeenth and eighteenth centuries . critics expressed their judgments in published treatises and in periodicals such as le mercure galant ; philosophical ideas were also developed in oral conversations between members of the newly founded royal'

As expected, the model knows "_something about 17th-century aesthetics_." However, if you increase the sampling parameter you will see that the model will start to generate gibberish. This is to be expected, given that we're dealing with a small model trained on even less text.

In the end, good language models require a lot of training and data. If you would like to train more capable models, you need to invest in computing, given that training rounds can easily last days, weeks, and even months. In this [repository](https://github.com/Nkluge-correa/Aira), we have the code for training models like BERT and GPT-2 from scratch using the `transformers` library.

You can find two already models (`bert-base-wikitext`, `bert-base-bookcorpus`) trained on distinct datasets.

> Note: These models were trained on an RTX 3070 for approximately 15 days with a batch size of 8.

Let us briefly test one of these models.

In [6]:
from transformers import pipeline

model = 'AiresPucrs/bert-base-wikitext'
pipe = pipeline('fill-mask', model=model, tokenizer=model)

def unmask(string):
    outputs = pipe(string)
    for i, result in enumerate(outputs):
        print(f"Result {i+1}:")
        print(f"Score: {result['score']}")
        print(f"Token: {result['token']}")
        print(f"Token String: {result['token_str']}")
        print(f"Sequence: {result['sequence']}")
        print("\n")

unmask("Paris is the [MASK] of France.")

Result 1:
Score: 0.9921931624412537
Token: 3007
Token String: capital
Sequence: paris is the capital of france.


Result 2:
Score: 0.0008999903220683336
Token: 2803
Token String: centre
Sequence: paris is the centre of france.


Result 3:
Score: 0.0006859182612970471
Token: 2540
Token String: heart
Sequence: paris is the heart of france.


Result 4:
Score: 0.0004766975180245936
Token: 2415
Token String: center
Sequence: paris is the center of france.


Result 5:
Score: 0.0004332299577072263
Token: 2148
Token String: south
Sequence: paris is the south of france.




To summarize, developing a `language model` is a complex task that requires knowledge of `natural language processing`, machine learning, and resources, like data and computing power. Also, remember that developing a good language model requires a large amount of data and computational resources, so don't be discouraged if your first attempts don't yield cutting-edge results.

---

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).