# Char-based Generative model
---

### Load Dataset

In [97]:
with open('./data/mini_gpt.txt', 'r') as f:
    X_train_data = f.read()

len(X_train_data) , type(X_train_data)
X_train_data[:1000]

(527204238, str)

'Valkyria Chronicles III Senj no Valkyria 3 : Unrecorded Chronicles ( Japanese : 3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer 

In [98]:
with open('./data/valid_mini_gpt.txt', 'r') as f:
    X_valid_data = f.read()

len(X_valid_data) , type(X_valid_data)
X_valid_data[:1000]

(1025171, str)

'Homarus gammarus Homarus gammarus , known as the European lobster or common lobster , is a species of clawed lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into planktonic larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . Description Homarus gammarus is a large crustacean , with a body length up to 60 centimetres ( 24 in ) and weighing up to 5 6 kilograms ( 11 13 lb ) , although the lobsters caught in lobster pots are usually 23 38 cm ( 9 15 in ) long and weigh 0 7 2 2 kg ( 1 5 4 9 lb ) . Like other

In [99]:
with open('./data/test_mini_gpt.txt', 'r') as f:
    X_test_data = f.read()

len(X_test_data) , type(X_test_data)
X_test_data[:1000]

(1258729, str)

'Robert Boulter Robert Boulter is an English film , television and theatre actor . He had a guest starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as " Craig " in the episode " Teddy \'s Story " of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the Menier Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . In 2006 , Boulter starred alongside Whishaw in the play Citizenship written by Mark Ravenhill . He appeared on a 2006 episode of t

## Encoding - Decoding

### Basic mapper as coding

In [100]:
s = ''.join(set(X_train_data))
coding_str = sorted(s)
''.join(coding_str) , len(coding_str)

(' !"#$%&\'()*+,-./0123456789:;<>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~',
 94)

In [101]:
s = ''.join(set(X_test_data))
''.join(sorted(s)) , len(s)
s = ''.join(set(X_valid_data))
''.join(sorted(s)) , len(s)

(' !"#$%&\'()*+,-./0123456789:;<>?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^abcdefghijklmnopqrstuvwxyz',
 86)

(' !"$%&\'()*+,-./0123456789:;<>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]abcdefghijklmnopqrstuvwxyz~',
 86)

In [102]:
encoder = {}
decoder = {}
for i, char in enumerate(coding_str): 
    encoder[char] = i
    decoder[i] = char


encoder['M']
decoder[encoder['M']]

44

'M'

In [103]:
encode = lambda x: [encoder[char] for char in x]
encode('Mahanth')
decode = lambda x: ''.join([decoder[i] for i in x])
decode(encode('Mahanth'))

[44, 64, 71, 64, 77, 83, 71]

'Mahanth'

## tiktoken 

In [104]:
# !pip install -q tiktoken

In [105]:
import tiktoken

tikt = tiktoken.get_encoding('gpt2')
tikt.n_vocab

50257

In [106]:
tikt.encode('Mahanth')
tikt.decode(tikt.encode('Mahanth'))

[44, 19210, 400]

'Mahanth'

## Store Encoding into a ```TENSOR``` Using Pytorch

In [107]:
import torch 

In [108]:
X_train = torch.tensor(encode(X_train_data) , dtype = torch.long)

In [109]:
X_test = torch.tensor(encode(X_test_data) , dtype = torch.long)
X_valid= torch.tensor(encode(X_valid_data) , dtype = torch.long)

X_train.shape , X_test.shape , X_valid.shape

(torch.Size([527204238]), torch.Size([1258729]), torch.Size([1025171]))

In [110]:
X_train.shape, X_train.dtype , X_train[:10]

(torch.Size([527204238]),
 torch.int64,
 tensor([53, 64, 75, 74, 88, 81, 72, 64,  0, 34]))

In [111]:
block_size = 8
X_train[:block_size + 1]
X_train_data[:block_size + 1]

tensor([53, 64, 75, 74, 88, 81, 72, 64,  0])

'Valkyria '

In [112]:
batch_size = 4

In [113]:
def batch_split( on = 'train',batch_size = batch_size , block_size = block_size , X_train = X_train,X_test = X_test,X_valid = X_valid):
    if on == 'train':
        X = X_train
    elif on == 'test':
        X = X_test
    else:
        X = X_valid
    ix = torch.randint(len(X) - block_size , (batch_size,) )
    x = [X[i : i + block_size] for i in ix]
    y = [X[i + 1 : i + block_size + 1] for i in ix]
    return torch.stack(x) , torch.stack(y) 

x,y = batch_split()
x.shape , y.shape 
x,y

(torch.Size([4, 8]), torch.Size([4, 8]))

(tensor([[75, 64, 83, 72, 78, 77, 82, 71],
         [82, 66, 78, 81, 83, 68, 67,  0],
         [72, 79, 64, 83, 68, 67,  0, 82],
         [81, 68, 64, 82, 68,  0, 72, 77]]),
 tensor([[64, 83, 72, 78, 77, 82, 71, 72],
         [66, 78, 81, 83, 68, 67,  0, 65],
         [79, 64, 83, 68, 67,  0, 82, 68],
         [68, 64, 82, 68,  0, 72, 77,  0]]))

using Bi Gram LM

In [114]:
import torch.nn as nn 
import torch.nn.functional as F

class BiGramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        self.token_embedding_table = nn.Embedding(vocab_size , vocab_size)

    def forward(self, x, y = None):
        logits = self.token_embedding_table(x) # (batch_size, block_size, vocab_size)
        if y is None:
            return logits
        
        # entropy expects : (N, C) input : (batch_size * block_size, vocab_size)
        loss = F.cross_entropy(logits.view(-1, self.vocab_size), y.view(-1))
        return logits, loss
    
    def generate(self, x, n_pred):
        for _ in range(n_pred):
            logits = self(x)[:,-1,:]
            prob_dist = F.softmax(logits, -1)
            x = torch.cat([x, torch.multinomial(prob_dist, 1)], -1)
            
            # logits = self.token_embedding_table(x)
            # x = torch.cat([x, torch.argmax(logits, -1)[:,-1].unsqueeze(-1)], -1)
        return x

m = BiGramLanguageModel(len(coding_str))
logits,loss = m(x,y)
logits.shape, loss

(torch.Size([4, 8, 94]), tensor(5.2589, grad_fn=<NllLossBackward0>))

random giberish text generation 

In [115]:
x1 = torch.zeros((1,1), dtype = torch.long)
decode(m.generate( x1 , n_pred = 100)[0].tolist())

' J\\u#fJ};9J<mU/\\IGv~r9u[4HU0`hK6d&Q\\F@2Yw<;v|OKCAp>$KLOMuf$c6k>Cko/vv@:seyIMGcIBTOitDQTO(T"H%+iy]r~s-'

#### Training the model

using Adam optimiser

In [116]:
optimiser = torch.optim.Adam(m.parameters(), lr = 1e-2)

In [122]:
batch_size = 64
for i in range(10000):
    x,y = batch_split()
    optimiser.zero_grad(set_to_none = True)
    _, loss = m(x,y)
    loss.backward()
    optimiser.step()
    if i % 1000 == 0:
        print(f'Loss : {loss.item()}')
        # print(decode(m.generate(x[:1], n_pred = 100)[0].tolist()))
print(f'Loss : {loss.item()}')

Loss : 2.4655206203460693
Loss : 2.448436975479126
Loss : 2.3681063652038574
Loss : 2.506385564804077
Loss : 2.8326637744903564
Loss : 2.7357940673828125
Loss : 2.620405912399292
Loss : 2.4185690879821777
Loss : 2.194861888885498
Loss : 3.183865785598755
Loss : 2.260883092880249


In [123]:
x1 = torch.zeros((1,1), dtype = torch.long)
decode(m.generate( x1 , n_pred = 250)[0].tolist())

' And thid tore blthesintof t . rting thathertrir aptathenuce , . ravee a h to Fil duatth aticalearn oto acerngaves 246 c , s Theraton nd caplyommasthoro . tismenofrwatrasthiorcegh preldy cha . Hea ted lpongte e , dinck sers rirred ofat ivinitiolls pss'

In [124]:
import datetime

start_time = datetime.datetime.now()
x1 = torch.zeros((1,1), dtype = torch.long)
decode(m.generate( x1 , n_pred = 1000)[0].tolist())
k = datetime.datetime.now() - start_time

print(f'Time taken : {k}')

' talurstoff 3 abe . coralid l jan orok s r an Whid pleis , Meis l as blllyd o s wasig isitby ointh Berarmecio iou ran amayinarsipt " wigured in thinngeparoy Pl s Jutsced trsply heeded ftonithe isch Chittorfocecedertorthon ia cl sahed mpathed ppl d , lise sbuly cuanun sthen t miper . to , les B . rinoung ancelehac s n pesing fed P eas , wn serar . iff Ralem jedins min bes thes , ak tin , Uns Borigle lltandeoveblfonintahart : Is peapas marediceurinugede , ctnano , tins prn cag s , Sce s sty aratrmeulol , " Nathevene rylls whts Founid iccas far sshimor to pral Thed If s m vedelere tore Ha whyerviovickaiurs thit mintially tour gun Bangld s fialed f s Stll milatersthenonune amied hedan Therinedatra Pondipar otendinthenug inger Ind by s Th frereafforerecon ne , o y y Hum " fooredal be n . bes an balanth cand wo th , wioaralicangecriomet En the fatil fe t re iltrllend nd d . Thastir oy bubredey wre , pectoll s d obat theles ro azo opth toorbl aroumipicerenaceson . Fre dec anthariampsoldengho

Time taken : 0:00:00.195715
