# Text Generator based on next character prediction

Made using a simple 2 hidden layered Neural Network, this text generator can predict next characters of the input text provided.

Here, the intension is not to generate meaningful sentences, we require a lot of compute for that. This app aims at showing how a vanilla neural network is also capable of capturing the format of English language, and generate words that are (very close to) valid words. Notice that the model uses capital letters (including capital I), punctuation marks and fullstops nearly correct. The text is generated paragraph wise, because the model learnt this from the text corpus.

This model was trained on a simple 600 KB text corpus titled: 'Gulliver's Travels'

Streamlit application: [Link](https://skynet-text-generator-ml.streamlit.app/)

In [None]:
import torch
import torch.nn.functional as F
from torch import nn
import pandas as pd
import matplotlib.pyplot as plt # for making figures
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from pprint import pprint

In [None]:
torch.__version__

'2.2.1+cu121'

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
device

device(type='cuda')

In [None]:
# Open the file in read mode
with open('gt.txt', 'r') as file:
    # Read the entire content of the file
    thefile = file.read()

In [None]:
# thefile = thefile.lower()

In [None]:
content = thefile[:-2000]
test = thefile[-2000:]

In [None]:
len(content), len(test)

(567139, 2000)

In [None]:
print(type(content))

<class 'str'>


In [None]:
# words = pd.read_csv('names-long.csv')["Name"]
# words = words.str.lower()
# words = words.str.strip()
# words = words.str.replace(" ", "")

# words = words[words.str.len() > 2]
# words = words[words.str.len() < 10]

# # Randomly shuffle the words
# words = words.sample(frac=1).reset_index(drop=True)
# words = words.tolist()

# # Remove words having non alphabets
# words = [word for word in words if word.isalpha()]
# words[:10]


In [None]:
# Create a dictionary to store unique characters and their indices
stoi = {}
stoi['@'] = 0

# Iterate through each character in the string
i = 1
for char in sorted(content):
    # Check if the character is not already in the dictionary
    if char not in stoi:
        # Add the character to the dictionary with its index
        stoi[char] = i
        i+=1

# Print the dictionary
print(stoi)

{'@': 0, '\n': 1, ' ': 2, '!': 3, '(': 4, ')': 5, ',': 6, '-': 7, '.': 8, '0': 9, '1': 10, '2': 11, '3': 12, '4': 13, '5': 14, '6': 15, '7': 16, '8': 17, '9': 18, ':': 19, ';': 20, '?': 21, 'A': 22, 'B': 23, 'C': 24, 'D': 25, 'E': 26, 'F': 27, 'G': 28, 'H': 29, 'I': 30, 'J': 31, 'K': 32, 'L': 33, 'M': 34, 'N': 35, 'O': 36, 'P': 37, 'Q': 38, 'R': 39, 'S': 40, 'T': 41, 'U': 42, 'V': 43, 'W': 44, 'X': 45, 'Y': 46, '[': 47, ']': 48, 'a': 49, 'b': 50, 'c': 51, 'd': 52, 'e': 53, 'f': 54, 'g': 55, 'h': 56, 'i': 57, 'j': 58, 'k': 59, 'l': 60, 'm': 61, 'n': 62, 'o': 63, 'p': 64, 'q': 65, 'r': 66, 's': 67, 't': 68, 'u': 69, 'v': 70, 'w': 71, 'x': 72, 'y': 73, 'z': 74, 'æ': 75, 'œ': 76, '–': 77, '—': 78, '‘': 79, '’': 80, '“': 81, '”': 82}


In [None]:
itos = {value: key for key, value in stoi.items()}

# Print the interchanged dictionary
print(itos)

{0: '@', 1: '\n', 2: ' ', 3: '!', 4: '(', 5: ')', 6: ',', 7: '-', 8: '.', 9: '0', 10: '1', 11: '2', 12: '3', 13: '4', 14: '5', 15: '6', 16: '7', 17: '8', 18: '9', 19: ':', 20: ';', 21: '?', 22: 'A', 23: 'B', 24: 'C', 25: 'D', 26: 'E', 27: 'F', 28: 'G', 29: 'H', 30: 'I', 31: 'J', 32: 'K', 33: 'L', 34: 'M', 35: 'N', 36: 'O', 37: 'P', 38: 'Q', 39: 'R', 40: 'S', 41: 'T', 42: 'U', 43: 'V', 44: 'W', 45: 'X', 46: 'Y', 47: '[', 48: ']', 49: 'a', 50: 'b', 51: 'c', 52: 'd', 53: 'e', 54: 'f', 55: 'g', 56: 'h', 57: 'i', 58: 'j', 59: 'k', 60: 'l', 61: 'm', 62: 'n', 63: 'o', 64: 'p', 65: 'q', 66: 'r', 67: 's', 68: 't', 69: 'u', 70: 'v', 71: 'w', 72: 'x', 73: 'y', 74: 'z', 75: 'æ', 76: 'œ', 77: '–', 78: '—', 79: '‘', 80: '’', 81: '“', 82: '”'}


In [None]:
block_size = 120 # context length: how many characters do we take to predict the next one?
X, Y = [], []
for i in range(len(content)-block_size-2):
  X.append([stoi[x] for x in content[i:i+block_size]])
  Y.append(stoi[content[i+block_size]])

# Move data to GPU

X = torch.tensor(X).to(device)
Y = torch.tensor(Y).to(device)

In [None]:
X

tensor([[37, 22, 39,  ..., 63,  2, 68],
        [22, 39, 41,  ...,  2, 68, 66],
        [39, 41,  2,  ..., 68, 66, 49],
        ...,
        [69, 53, 67,  ..., 53,  6,  2],
        [53, 67, 68,  ...,  6,  2, 60],
        [67, 68, 57,  ...,  2, 60, 49]], device='cuda:0')

In [None]:
Y

tensor([66, 49, 70,  ..., 60, 49, 70], device='cuda:0')

In [None]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([567017, 120]), torch.int64, torch.Size([567017]), torch.int64)

In [None]:
# Embedding layer for the context

emb_dim = 10
emb = torch.nn.Embedding(len(stoi), emb_dim)


In [None]:
emb.weight

Parameter containing:
tensor([[ 6.8964e-01,  1.1622e+00,  8.3505e-02, -2.3599e-01,  6.2329e-01,
          6.1196e-01,  1.3798e+00, -5.3010e-01, -7.0471e-01, -5.8698e-01],
        [ 2.0053e-01,  5.4518e-01, -6.2755e-01,  2.3470e+00, -8.9879e-01,
          2.0934e-01, -1.5553e+00,  1.1621e+00, -7.3243e-01,  4.8444e-01],
        [-7.9547e-01,  9.3606e-01, -9.9410e-01,  2.5048e-01,  4.5808e-01,
          1.2049e+00, -1.5866e-01, -7.2878e-01,  2.1527e+00,  9.0378e-01],
        [-1.4879e+00,  1.1102e-02, -1.2337e-01,  4.1331e-01,  1.8910e-01,
         -5.3438e-01,  6.7817e-01,  5.4391e-01,  9.7073e-01,  1.9197e+00],
        [ 1.5457e-01, -1.7849e+00, -4.6840e-01,  1.1649e+00, -8.9022e-01,
          1.2454e+00, -5.7183e-01, -5.5144e-01,  1.3761e+00, -2.2452e-01],
        [-4.8251e-01, -1.0155e+00,  3.7058e-01,  1.1520e+00,  5.4755e-01,
          1.0981e+00, -9.6724e-02, -6.4712e-02, -2.4221e+00,  2.0666e-02],
        [ 4.9746e-01, -1.2004e+00,  8.3133e-01, -6.0261e-01,  4.1018e-01,
          

In [None]:
emb.weight.shape

torch.Size([83, 10])

In [None]:
# # Function to visualize the embedding in 2d space if 2 dimensions are used

# def plot_emb(emb, itos, ax=None):
#     if ax is None:
#         fig, ax = plt.subplots()
#     for i in range(len(itos)):
#         x, y = emb.weight[i].detach().cpu().numpy()
#         ax.scatter(x, y, color='k')
#         ax.text(x + 0.05, y + 0.05, itos[i])
#     return ax

# plot_emb(emb, itos)

In [None]:
class NextChar(nn.Module):
  def __init__(self, block_size, vocab_size, emb_dim, hidden_size1, hidden_size2):
    super().__init__()
    self.emb = nn.Embedding(vocab_size, emb_dim)
    self.lin1 = nn.Linear(block_size * emb_dim, hidden_size1)
    self.lin2 = nn.Linear(hidden_size1, hidden_size2)
    self.lin3 = nn.Linear(hidden_size2, vocab_size)

  def forward(self, x):
    x = self.emb(x)
    x = x.view(x.shape[0], -1)
    x = torch.sin(self.lin1(x)) # Activation function : change this
    x = self.lin2(x)
    return x

In [None]:
# Generate names from untrained model

model = NextChar(block_size, len(stoi), emb_dim, 500, 300).to(device)
model = torch.compile(model)

no_of_chars = 200
g = torch.Generator()
g.manual_seed(200)
def generate_name(model, inp, itos, stoi, block_size, max_len=no_of_chars):

    context = [0] * block_size
    # inp = inp.lower()
    if len(inp) <= block_size:
      for i in range(len(inp)):
        context[i] = stoi[inp[i]]
    else:
      for i in range(len(inp)-block_size,len(inp)):
        context[i] = stoi[inp[i]]

    name = ''
    for i in range(max_len):
        x = torch.tensor(context).view(1, -1).to(device)
        y_pred = model(x)
        ix = torch.distributions.categorical.Categorical(logits=y_pred).sample().item()
        if ix in itos:
          ch = itos[ix]
        # if ch == '.':
        #     break
          name += ch
          context = context[1:] + [ix]
    return name

print(generate_name(model, "@", itos, stoi, block_size, no_of_chars))

DJ,PdkG5:(l–7!JgkhaP9mxSK‘et(WuRrz;2tPd6“LCSHM


In [None]:
for param_name, param in model.named_parameters():
    print(param_name, param.shape)

_orig_mod.emb.weight torch.Size([83, 10])
_orig_mod.lin1.weight torch.Size([500, 1200])
_orig_mod.lin1.bias torch.Size([500])
_orig_mod.lin2.weight torch.Size([300, 500])
_orig_mod.lin2.bias torch.Size([300])
_orig_mod.lin3.weight torch.Size([83, 300])
_orig_mod.lin3.bias torch.Size([83])


In [None]:
# Train the model

loss_fn = nn.CrossEntropyLoss()
opt = torch.optim.AdamW(model.parameters(), lr=0.01)
import time
# Mini-batch training
batch_size = 4096
print_every = 10
elapsed_time = []
for epoch in range(200):
    start_time = time.time()
    for i in range(0, X.shape[0], batch_size):
        x = X[i:i+batch_size]
        y = Y[i:i+batch_size]
        y_pred = model(x)
        loss = loss_fn(y_pred, y)
        loss.backward()
        opt.step()
        opt.zero_grad()
    end_time = time.time()
    elapsed_time.append(end_time - start_time)
    if epoch % print_every == 0:
        print(epoch, loss.item())


0 2.2945427894592285
10 0.8446438312530518
20 0.7583667635917664
30 0.7739819288253784
40 0.7692346572875977
50 0.6106970310211182
60 0.5997738242149353
70 0.5012879967689514
80 0.493354469537735
90 0.4942704439163208
100 0.4229840934276581
110 0.3926779627799988
120 0.3828011155128479
130 0.39322641491889954
140 0.4038969576358795
150 0.3873109221458435
160 0.35438305139541626
170 0.37257158756256104
180 0.38877421617507935
190 0.3960263431072235


In [None]:
# # Visualize the embedding

# plot_emb(model.emb, itos)

In [None]:
# Generate text from the trained model
print(generate_name(model, "I love travelling around the ", itos, stoi, block_size, 1000))

 the saw the pars parion of the folt not have the deat thes ballon umarased, and a c fundupuintowayed the sevrimey the wing my severaltion; these sriegs.

I spok any of malien, and his eiriotd vawn napw of plangdent as I  clay dory cof on seaml fasured ived I ray wmyared, hage “heittabre fortps saaghte the severs, whede unhiph vh
ses vistat instred mane besce to delsw, houghfc sewere, and whrf in knekn, amonges is tail or my houadly on ro tot excomicise. I”usagale wisels indevel hirif my mastitale of qncear of the moring lagost for well cour mompliethitten tare finhl me weand de, and my habreavous the homeblratenn, and in the every cane. In of it to leasy Yahe ink cirtled with bimen alay dut of the phandever not do which I had have made domith of the natiel yot; bown iccourh I her somalh; treards, which in to the co take of them they to to coal him the comp mistive a oln fourne ut my moint on gost of a do nuct: “l edeicg, they hams; satiled wont teavinedy him rovent lay ragly and aming

In [None]:
torch.save(model.state_dict(),"gt_eng_model_upper_two_hid_layer_emb10.pth")

#### Tuning knobs

1. Embedding size
2. MLP
3. Context length

### Streamlit Application
Explore changes in the text generated based on Embedding size, block size and seed text in the streamlit application: 
[Link](https://skynet-text-generator-ml.streamlit.app/)