# **Bigram Model**

Here we will generate sequences of words that mimic Shakespearean language using only the preceding text as context. This model predicts the next word based on the previous word, creating text that captures the style and flow of Shakespeare's writing. By training on Shakespeare's works, the model learns the probabilistic relationships between words, allowing it to generate coherent and stylistically accurate sentences. But since it only does it using a single word, the performance might not be optimal

In [1]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [2]:
print(f'Length of the dataset : {len(text)}')

Length of the dataset : 1115394


In [3]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f'Characters : {"".join(chars)}\nVocab size : {vocab_size}')

Characters : 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size : 65


In [5]:
stoi = {ch : i for i, ch in enumerate(chars)}
itos = {i : ch for i, ch in enumerate(chars)}
encode = lambda s : [stoi[c] for c in s]
decode = lambda l : "".join([itos[i] for i in l])

In [6]:
encode("Hariprashaad")

[20, 39, 56, 47, 54, 56, 39, 57, 46, 39, 39, 42]

In [7]:
decode([20, 39, 56, 47, 54, 56, 39, 57, 46, 39, 39, 42])

'Hariprashaad'

## **Modelling**

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [9]:
data = torch.tensor(encode(text), dtype = torch.long)
print(data.shape, data.dtype)

torch.Size([1115394]) torch.int64


In [10]:
data[:100]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])

### **(i) Train-test split**

In [11]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

In [12]:
block_size = 8
train_data[:block_size +1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [13]:
x = train_data[:block_size]
y = train_data[1 : block_size +1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


Hence our transformer could predict from any number of words from 1 to block_size

In [14]:
torch.manual_seed(1337)
batch_size = 4
block_size = 8

def get_batch(split):
  data = train_data if split == 'train' else val_data
  ix = torch.randint(len(data) - block_size, (batch_size,))
  x = torch.stack([data[i: i+block_size] for i in ix])
  y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
  return x,y

In [15]:
xb, yb = get_batch('train')
print('inputs : ', xb.shape)
print('targets : ', yb.shape)

print('------')
for b in range(batch_size):
  for t in range(block_size):
    context = xb[b, : t+1]
    target = yb[b, t]
    print(f"when input is {context.tolist()} the target: {itos[target.item()]}")

inputs :  torch.Size([4, 8])
targets :  torch.Size([4, 8])
------
when input is [24] the target: e
when input is [24, 43] the target: t
when input is [24, 43, 58] the target: '
when input is [24, 43, 58, 5] the target: s
when input is [24, 43, 58, 5, 57] the target:  
when input is [24, 43, 58, 5, 57, 1] the target: h
when input is [24, 43, 58, 5, 57, 1, 46] the target: e
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: a
when input is [44] the target: o
when input is [44, 53] the target: r
when input is [44, 53, 56] the target:  
when input is [44, 53, 56, 1] the target: t
when input is [44, 53, 56, 1, 58] the target: h
when input is [44, 53, 56, 1, 58, 46] the target: a
when input is [44, 53, 56, 1, 58, 46, 39] the target: t
when input is [44, 53, 56, 1, 58, 46, 39, 58] the target:  
when input is [52] the target: t
when input is [52, 58] the target:  
when input is [52, 58, 1] the target: t
when input is [52, 58, 1, 58] the target: h
when input is [52, 58, 1, 58, 46] the tar

### **(ii) Bigram modelling**

In [16]:
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
  def __init__(self, vocab_size):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

  def forward(self, idx, targets = None):
    logits = self.token_embedding_table(idx) #(B, T, C)

    if targets == None:
      loss = None
    else:
      # Since cross-entropy expects it to be either (X, C) or (X, C, Y)
      B, T, C = logits.shape
      logits = logits.view(B * T, C)
      targets = targets.view(B * T)

      loss = F.cross_entropy(logits, targets)
    return (loss, logits)

  def generate(self, idx, max_new_tokens):
    # idx is (B, T) array of current context
    for _ in range(max_new_tokens):
      # get the prediction with the current idx
      loss, logits = self(idx)
      # focus only on the last time step
      logits = logits[:, -1, :]
      # apply softmax
      probs = F.softmax(logits, dim = -1)
      # generate the next index using multinomial
      idx_next = torch.multinomial(probs, num_samples = 1)
      # concat the idx_next with idx to generate new idx
      idx = torch.cat((idx, idx_next), dim = 1)
    return idx

In [17]:
n = BigramLanguageModel(vocab_size)
loss, logits = n(xb, yb) # (4, 8)
print(logits.shape) # (4, 8, 65)
print(loss)

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


In [18]:
indices = n.generate(idx = torch.zeros((1,1), dtype = torch.long), max_new_tokens = 100)
indices

tensor([[ 0, 31, 56, 12, 55, 28,  7, 29, 35, 49, 58, 36, 53, 24,  4, 48, 24, 16,
         22, 45, 27, 24, 34, 64,  5, 30, 21, 53, 16, 55, 20, 42, 46, 57, 34,  4,
         60, 24, 24, 62, 39, 58, 48, 57, 41, 25, 54, 61, 24, 17, 30, 31, 28, 63,
         39, 53,  8, 55, 44, 64, 57,  3, 37, 57,  3, 64, 18,  7, 61,  6, 11, 43,
         17, 49, 64, 62, 48, 45, 15, 23, 18, 15, 46, 57,  2, 47, 35, 35,  8, 27,
         40, 64, 16, 52, 62, 13,  1, 25, 57,  3,  9]])

In [19]:
print(decode(indices[0].tolist()))


Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


### **(iii) Creating a pytorch optimizer**

In [20]:
optimizer = torch.optim.AdamW(n.parameters(), lr = 1e-3)

In [21]:
batch_size = 32
for steps in range(10000):
  xb, yb = get_batch('train')

  loss, logits = n(xb, yb)

  optimizer.zero_grad(set_to_none = True)
  loss.backward()
  optimizer.step()

  if steps % 1000 == 0:
    print(f'{loss.item() : .4f}')

 4.7040
 3.7031
 3.1372
 2.7768
 2.5845
 2.5105
 2.5316
 2.5048
 2.4697
 2.4839


In [22]:
indices = n.generate(idx = torch.zeros((1,1), dtype = torch.long), max_new_tokens = 100)
print(decode(indices[0].tolist()))


Iyoteng h hasbe pave pirance
Rie hicomyonthar's
Plinseard ith henoure wounonthioneir thondy, y helti


# **Summary**

In [23]:
torch.manual_seed(1337)

<torch._C.Generator at 0x7a03ac7989b0>

In [40]:
# Constants

batch_size = 32
block_size = 8
epochs = 10000
eval_interval = 1000
lr = 1e-3
eval_epochs = 400

In [25]:
def get_batch(split):
  data = train_data if split == 'train' else val_data
  ix = torch.randint(len(data) - block_size, (batch_size,))
  x = torch.stack([data[i: i+block_size] for i in ix])
  y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
  return x,y

In [33]:
@torch.no_grad()
def estimate_loss(model):
  out = {}
  model.eval()

  for split in ['train', 'val']:
    losses = torch.zeros(eval_epochs)
    for k in range(eval_epochs):
      X, Y = get_batch(split)
      loss, logits = model(X, Y)
      losses[k] = loss.item()
    out[split] = losses.mean()
  model.train()
  return out

In [29]:
class BigramLanguageModel(nn.Module):
  def __init__(self, vocab_size):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

  def forward(self, idx, targets = None):
    logits = self.token_embedding_table(idx)

    if targets == None:
      loss = None
    else:
      B, T, C = logits.shape
      logits = logits.view(B * T, C)
      targets = targets.view(B * T)

      loss = F.cross_entropy(logits, targets)
    return (loss, logits)

  def generate(self, idx, max_new_tokens):
    for _ in range(max_new_tokens):
      loss, logits = self(idx)
      logits = logits[:, -1, :]
      probs = F.softmax(logits, dim = -1)
      idx_next = torch.multinomial(probs, num_samples = 1)
      idx = torch.cat((idx, idx_next), dim = 1)
    return idx

In [30]:
xb, yb = get_batch('train')
print('inputs : ', xb.shape)
print('targets : ', yb.shape)

inputs :  torch.Size([32, 8])
targets :  torch.Size([32, 8])


In [41]:
model = BigramLanguageModel(vocab_size)
optimizer = torch.optim.AdamW(model.parameters(), lr = lr)

for epoch in range(epochs + 1):

  if epoch % eval_interval == 0:
    losses = estimate_loss(model)
    print(f"epoch : {epoch:^5} | train_loss {losses['train'] : .4f} | val_loss {losses['val'] : .4f}")

  xb, yb = get_batch('train')

  loss, logits = model(xb, yb)

  optimizer.zero_grad(set_to_none = True)
  loss.backward()
  optimizer.step()

epoch :   0   | train_loss  4.7126 | val_loss  4.7048
epoch : 1000  | train_loss  3.7356 | val_loss  3.7344
epoch : 2000  | train_loss  3.1374 | val_loss  3.1436
epoch : 3000  | train_loss  2.8113 | val_loss  2.8177
epoch : 4000  | train_loss  2.6411 | val_loss  2.6528
epoch : 5000  | train_loss  2.5663 | val_loss  2.5732
epoch : 6000  | train_loss  2.5284 | val_loss  2.5361
epoch : 7000  | train_loss  2.5062 | val_loss  2.5111
epoch : 8000  | train_loss  2.4748 | val_loss  2.4952
epoch : 9000  | train_loss  2.4794 | val_loss  2.4922
epoch : 10000 | train_loss  2.4658 | val_loss  2.4875


In [44]:
print(decode(model.generate(idx = torch.zeros((1,1), dtype = torch.long), max_new_tokens = 100)[0].tolist()))


THAy wind s:
Oh, 'dos t forandowig airimy mewhed hth
TFo's lteache him, ukss y.

wa ge-'s aryo rer n


The above is just a simple bigram model which generates the text using only the last word, hence its generation is not proper and in the next, we could use a transformer model using multiple previous word embeds