## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [1]:
from allennlp.data.token_indexers import TokenIndexer, PretrainedTransformerIndexer
from allennlp.data.tokenizers import Token, Tokenizer, PretrainedTransformerTokenizer

import nltk
import numpy as np
from os import listdir
from os.path import join as pathjoin
import pandas as pd
import torch
import torch.nn as nn
from torch.nn import functional as F
import tqdm

from mingpt.model import GPT, GPTConfig
from mingpt.trainer import Trainer, TrainerConfig
# make deterministic
from mingpt.utils import sample, set_seed
set_seed(128)
np.random.seed(128)

In [2]:
DATA_DIR = '/home/mlepekhin/data'
MODELS_DIR = '/home/mlepekhin/models'
transformer_model = 'DeepPavlov/rubert-base-cased'

In [3]:
from allennlp.data import Vocabulary


tokenizer = PretrainedTransformerTokenizer(transformer_model)
indexer = PretrainedTransformerIndexer(transformer_model)
bert_vocab = Vocabulary().from_files(
    pathjoin('/home/mlepekhin/models', 'allennlp_rubert_from_discriminator', 'vocab')
)
indexer.tokens_to_indices(tokenizer.tokenize('присоединились'), bert_vocab)

{'token_ids': [101, 29895, 102],
 'mask': [True, True, True],
 'type_ids': [0, 0, 0]}

In [4]:
bert_token_to_index = bert_vocab.get_token_to_index_vocabulary('tags')
bert_index_to_token = bert_vocab.get_index_to_token_vocabulary('tags')
bert_token_to_index['присоединились']

29895

In [5]:
import math
from torch.utils.data import Dataset

def detokenize(tokens):
    return ' '.join([str(x) for x in tokens[1:-1]]).replace(' ##', '')

class BPEDataset(Dataset):
    def __init__(self, data, block_size):
        data_size, vocab_size = len(data), len(bert_token_to_index)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [bert_token_to_index[word] for word in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [6]:
block_size = 128
tokenizer = PretrainedTransformerTokenizer(transformer_model)

In [7]:
topic_sentences_df = pd.read_csv('/home/mlepekhin/data/ru_topic_big_sentences.csv')
print(topic_sentences_df.shape)
topic_sentences_df.head()

(1000, 3)


Unnamed: 0.1,Unnamed: 0,topic,sentence
0,0,music,на тумбочке слева от шкафа рассматриваем альбо...
1,1,music,любое другое использование песен без дополните...
2,2,music,"как правило , изобразительному искусству и муз..."
3,3,music,детям частенько говорят : не занимайся музыкой...
4,4,music,"обновлен раздел "" фотографии "" - добавлено нес..."


In [8]:
topic_sentences_df.values[200:300]

array([[200, 'politics',
        'в этом году « дортрансэкспо » посетил министр транспорта и дорожного хозяйства республики татарстан фасхутдинов ильдус ирфанович , и дал'],
       [201, 'politics',
        'министр внутренних дел великобритании удоволетворил ходатайство об экстрадиции дудко'],
       [202, 'politics',
        '« с одной стороны , принципами демократии продиктовано то , что у нас в том или ином муниципальном образовании происходят'],
       [203, 'politics',
        'ситуацию надо менять кардинально » , — категорично высказался по этому поводу министр культуры области алексей бетехтин инвестиции в организацию'],
       [204, 'politics',
        'федеральным законом от 18 июля 2011 года № 242 -фз « о внесении изменений в отдельные законодательные акты российской федерации'],
       [205, 'politics',
        'вдвое сокращается срок госрегистрации заложенного имущества госдума вчера приняла во втором чтении законопроект " о внесении изменений в отдельные законодательные']

In [9]:
topic_dict = dict()

for topic, sentence in zip(topic_sentences_df.topic.values, topic_sentences_df.sentence.values):
    if topic not in topic_dict:
        topic_dict[topic] = []
    topic_dict[topic].append(sentence)
print(topic_dict.keys())

dict_keys(['music', 'education', 'politics', 'sport', 'business', 'literature', 'crime', 'travel', 'games', 'arhitecture'])


In [10]:
from mingpt.utils import sample

def generate_topic_dataset(train_text_file, state_dict_file, n_layer=4, n_head=4, n_embd=256,
                     texts_count=100, text_len=500):
    text_sentences = nltk.tokenize.sent_tokenize(open(train_text_file, 'r').read())
    tokens = np.concatenate([tokenizer.tokenize(sent) for sent in text_sentences])
    tokens = [str(token) for token in tokens]
    train_dataset = BPEDataset(tokens, block_size) 
    print("dataset is loaded")
    
    mconf = GPTConfig(
        train_dataset.vocab_size, train_dataset.block_size,
        n_layer=n_layer, n_head=n_head, n_embd=n_embd
    )
    model = GPT(mconf)
    model.load_state_dict(torch.load(state_dict_file))
    print("model is loaded")
    
    tconf = TrainerConfig(num_workers=1)
    trainer = Trainer(model, train_dataset, None, tconf)
    
    for topic, topic_sentences in topic_dict.items():        
        for text_id in range(texts_count):
            context = tokenizer.tokenize(np.random.choice(topic_sentences))
            x = torch.tensor([bert_token_to_index[str(s)] for s in context], dtype=torch.long)[None,...].to(trainer.device)
            y = sample(model, x, text_len, temperature=1.0, sample=True, top_k=10)[0]
            completion = ' '.join([bert_index_to_token[int(i)] for i in y]).replace(' ##', '')
            completion = completion.replace('[CLS]', '').replace('[SEP]', '')
            yield completion, topic
        

def test_keywords(train_text_file, state_dict_file, n_layer=8, n_head=8, n_embd=512):
    text_sentences = nltk.tokenize.sent_tokenize(open(train_text_file, 'r').read())
    tokens = np.concatenate([tokenizer.tokenize(sent)[1:-1] for sent in text_sentences])
    tokens = [str(token) for token in tokens]
    train_dataset = BPEDataset(tokens, block_size) 
    print("dataset is loaded")
    tokens_set = set(train_dataset.stoi.keys())
    for topic, topic_keywords in topics.items():
        print(len(set(topic_keywords) & tokens_set))

In [11]:
GENRE_DATA_DIR = '/home/mlepekhin/data/genre'
GPT_MODELS_DIR = '/home/mlepekhin/models/mini_gpt_bpe_tuned/'
LANG = 'ru'

In [12]:
#for train_text_file in tqdm.tqdm(listdir(pathjoin(GENRE_DATA_DIR, LANG))):
#    label = train_text_file[:-4]
#    print(label)
#    test_keywords(
#        pathjoin(GENRE_DATA_DIR, LANG, train_text_file),
#        pathjoin(GPT_MODELS_DIR, LANG, label)
#    )

In [13]:
result_df = pd.DataFrame()

In [None]:
for train_text_file in tqdm.tqdm(listdir(pathjoin(GENRE_DATA_DIR, LANG))):
    label = train_text_file[:-4]
    if label.startswith('A'):
        for text, topic in generate_topic_dataset(
            pathjoin(GENRE_DATA_DIR, LANG, train_text_file),
            pathjoin(GPT_MODELS_DIR, LANG, label)
        ):
            result_df = result_df.append({'text': text, 'target': label, 'topic': topic}, ignore_index=True)

  0%|          | 0/11 [00:00<?, ?it/s]

data has 199175 characters, 119547 unique.
dataset is loaded
model is loaded


  9%|▉         | 1/11 [33:21<5:33:32, 2001.21s/it]

data has 142847 characters, 119547 unique.
dataset is loaded
model is loaded


 18%|█▊        | 2/11 [1:06:20<4:59:12, 1994.76s/it]

data has 603817 characters, 119547 unique.
dataset is loaded
model is loaded


 27%|██▋       | 3/11 [1:39:29<4:25:44, 1993.04s/it]

data has 96004 characters, 119547 unique.
dataset is loaded
model is loaded


 36%|███▋      | 4/11 [2:12:25<3:51:54, 1987.72s/it]

data has 287136 characters, 119547 unique.
dataset is loaded
model is loaded


In [17]:
result_df.tail()

Unnamed: 0,target,text,topic
9995,A16,в конце xvii века западные государства постро...,arhitecture
9996,A16,"шел через тучков мост , увидел проплывающую ""...",arhitecture
9997,A16,также в стенах города находится и величествен...,arhitecture
9998,A16,в случае нарушения сроков предоставления в со...,arhitecture
9999,A16,варианты нестандартных табличек : в соответст...,arhitecture


In [18]:
result_df.to_csv('/home/mlepekhin/data/min_gpt_bpe/ru_train_topic_big_prefixes.csv')