<a href="https://colab.research.google.com/github/D-Sokol/denotarikon/blob/main/Sandbox.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers==4.1.1



In [2]:
import torch
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', add_prefix_space=True)
model = GPT2LMHeadModel.from_pretrained('gpt2').train(False).to(device)

In [4]:
start_text = "The best possible example for denotarikon text generation should starts with"
start_tokens = tokenizer.encode(start_text)
start_tokens

[383,
 1266,
 1744,
 1672,
 329,
 2853,
 313,
 283,
 1134,
 261,
 2420,
 5270,
 815,
 4940,
 351]

In [5]:
# Parameter for nucleus sampling
p_threshold = 0.95
# Desired number of tokens in the result.
n_tokens = 111

In [6]:
tokens = start_tokens[:]
with torch.no_grad():
    result = model(torch.tensor(tokens, device=device)[None], past_key_values=None)
    next_logits, past = result['logits'][0, -1, :], result['past_key_values']
    for i in range(len(tokens), n_tokens):
        next_probas = torch.softmax(next_logits, dim=-1).cpu()


        sorted_p, sorted_ix = torch.sort(next_probas, descending=True)
        cumulative_p = torch.cumsum(sorted_p, dim=-1)

        # Number of possible choices for next token, calculated as minimal n
        #  such that sum of probabilities of the first n tokens exceeds p_threshold
        n_tokens_next = np.argmax(cumulative_p.numpy() > p_threshold) + 1

        sorted_p = sorted_p[:n_tokens_next]
        sorted_p /= cumulative_p[n_tokens_next-1]
        ix_ix = np.random.choice(n_tokens_next, p=sorted_p.numpy())
        next_ix = sorted_ix[ix_ix]
        tokens.append(next_ix.item())

        # TODO: split by 1st letter
        result = model(next_ix[None], past_key_values=past)
        next_logits, past = result['logits'][0, :], result['past_key_values']
        next_probas = torch.softmax(next_logits, dim=-1).cpu()

In [7]:
print(tokenizer.decode(tokens))

 The best possible example for denotarikon text generation should starts with the text lines with a combination of height and horizontal position.

Multiple lines could be created with the datatype: DT_UNLIMITED. (If you want to use corebeyondXXX mode, you'll have to set the ddt_limit=15 to VYS_DETECTORIAL.) This should also take care of transitions.

Note that if you wanna use four separate datatypes or more, one is the best option.
