## Basic LSTM text generator in PyTorch from guided tutorial at:
https://machinelearningmastery.com/text-generation-with-lstm-in-pytorch/

trained on: https://www.gutenberg.org/ebooks/11 with project gutenberg intro and outro removed

### Load the data
plus get unique char set

In [23]:
import numpy as np

# load ascii text and covert to lowercase
filename = "sonnets.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for c, i in char_to_int.items())

# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  97920
Total Vocab:  39


### window over the data
split data into X and Y, where X is array of arrays of **hyperparameter seq_length** characters and Y is the next chapter after this array

In [2]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100

dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  97820


### Transform data to tensors
and reshape X to be [number_of_sequences, length_of_sentence, num_of_features] where:  
number_of_seq.. = amount of samples generated  
length_of_sentence also known as **time steps** = seq_length  
num_of_features = output len, 1 character or more  

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim

# reshape X to be [samples, time steps, features]
X = torch.tensor(dataX, dtype=torch.float32).reshape(n_patterns, seq_length, 1)
X = X / float(n_vocab)
y = torch.tensor(dataY)
print(X.shape, y.shape)

torch.Size([97820, 100, 1]) torch.Size([97820])


### Define simple LSTM model

In [4]:
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data

class CharModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=1, batch_first=True)
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(256, n_vocab)
    def forward(self, x):
        x, _ = self.lstm(x)
        # take only the last output
        x = x[:, -1, :]
        # produce output
        x = self.linear(self.dropout(x))
        return x

### Important: device setup for speed if GPU available

In [10]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

### Train model
* n_epochs
* batch_size

#### Batch data

In [13]:
n_epochs = 40
batch_size = 128
model = CharModel()

model.to(device)

X = X.to(device)
y = y.to(device)
 
optimizer = optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss(reduction="sum")
loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=batch_size)

#### Train model

In [14]:
best_model = None
best_loss = np.inf
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    # Validation
    model.eval()
    loss = 0
    with torch.inference_mode():
        for X_batch, y_batch in loader:
            y_pred = model(X_batch)
            loss += loss_fn(y_pred, y_batch)
        if loss < best_loss:
            best_loss = loss
            best_model = model.state_dict()
        print("Epoch %d: Cross-entropy: %.4f" % (epoch, loss))
 
torch.save([best_model, char_to_int], "single-char.pth")

Epoch 0: Cross-entropy: 275209.8438
Epoch 1: Cross-entropy: 260878.3594
Epoch 2: Cross-entropy: 250564.0625
Epoch 3: Cross-entropy: 246468.7969
Epoch 4: Cross-entropy: 244189.2656
Epoch 5: Cross-entropy: 240961.9844
Epoch 6: Cross-entropy: 237438.6094
Epoch 7: Cross-entropy: 236001.0156
Epoch 8: Cross-entropy: 232041.2969
Epoch 9: Cross-entropy: 229841.6250
Epoch 10: Cross-entropy: 227710.4531
Epoch 11: Cross-entropy: 224227.5312
Epoch 12: Cross-entropy: 221384.6250
Epoch 13: Cross-entropy: 218321.7656
Epoch 14: Cross-entropy: 215453.1250
Epoch 15: Cross-entropy: 212527.3750
Epoch 16: Cross-entropy: 209258.3125
Epoch 17: Cross-entropy: 206228.2500
Epoch 18: Cross-entropy: 202893.5312
Epoch 19: Cross-entropy: 200059.5312
Epoch 20: Cross-entropy: 198487.7969
Epoch 21: Cross-entropy: 194645.6094
Epoch 22: Cross-entropy: 191912.0156
Epoch 23: Cross-entropy: 188822.9844
Epoch 24: Cross-entropy: 186370.4844
Epoch 25: Cross-entropy: 183326.8906
Epoch 26: Cross-entropy: 180954.7812
Epoch 27: C

NameError: name 'char_to_dict' is not defined

In [15]:
torch.save([best_model, char_to_int], "single-char.pth")

### Generating text
as the model is generating characters based on input characters of **seq_length**, to generate new data, we have to supply it with inputs of that format, eg. from other sonnets or (to test) from our data source / same author

then, the model's output will be the predicted next character, which we can take and add at the end of our original input, and remove the input's first element, maintaining the input shape, (plus add the predicted char to our final output kept separately), then run through the model again and again for a specified number of times

important note: LSTMs dont just return the end character, but also it's internal state history, therefore, from the output we have to make a decision whether to only take the last output or try to include the previous states and argmax them or otherwise

##### getting a random sentence from the data

In [16]:
seq_length = 100
start = np.random.randint(0, len(raw_text)-seq_length)
prompt = raw_text[start:start+seq_length]

or you could try passing in a left-padded array of seq_len to start without any data

##### Transform the input into proper format

In [22]:
pattern = [char_to_int[c] for c in prompt]

Predict

In [18]:
output_length = 1000

In [24]:
model.eval()
print('Prompt: "%s"' % prompt)
with torch.no_grad():
    for i in range(output_length):
        # format input array of int into PyTorch tensor
        x = np.reshape(pattern, (1, len(pattern), 1)) / float(n_vocab)
        x = torch.tensor(x, dtype=torch.float32).to(device)
        # generate logits as output from the model
        prediction = model(x)
        # convert logits into one character
        index = int(prediction.argmax())
        result = int_to_char[index]
        print(result, end="")
        # append the new character into the prompt for the next iteration
        pattern.append(index)
        pattern = pattern[1:]
print()
print("Done.")

Prompt: "ll grind
 on newer proof, to try an older friend,
 a god in love, to whom i am confin'd.
   then giv"
e, not thoue, thy love' thy love shal wort.

 lxxiii

 then mo the sorengs so the wirld toeel oowe
 and tour the baau and thene ooo derprede,
 and toen the tare to thee in thei whll shee,
 and there ooo mo bear eear hev soen lo toene,
 and thene ae thine, and the tire that toee  and to the romi of toet wher iedves siahe,
 the caiest derter dan hacr hotm tiee,    thes thou dester te the tire that iove thee soote,
   then thou aettres thite whth thee, and then to toowe.

 xxii

 nh the world sfat soici siee hor lo thee,
 and tour in then if touth io touth to gesr,
 and there ooo mo horerg certer derterss,
 which that this soogse that thich i foor thee,
 and theu foo mo and thin iet soeel coowent,
 woth she loe to thire whuh thut lose so lroe,
 thou has she lerercics of the wirl of lroes,
 whine ie thou singss thatl to thls drty'st,
 and thereiore shat thich whth soulh to thee   

##### Extra: loading a model

In [None]:
# best_model, char_to_int = torch.load("single-char.pth")
# n_vocab = len(char_to_int)
# int_to_char = dict((i, c) for c, i in char_to_int.items())
 
# # reload the model
# class CharModel(nn.Module):
#     def __init__(self):
#         super().__init__()
#         self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=1, batch_first=True)
#         self.dropout = nn.Dropout(0.2)
#         self.linear = nn.Linear(256, n_vocab)
#     def forward(self, x):
#         x, _ = self.lstm(x)
#         # take only the last output
#         x = x[:, -1, :]
#         # produce output
#         x = self.linear(self.dropout(x))
#         return x
# model = CharModel()
# model.load_state_dict(best_model)

### Bigger model, more interesting data
more layers (1 -> 2)  
bigger hidden state (256 -> 500)  
plus a dropout between the lstm layers  
more epochs

1

In [25]:
import numpy as np

# load ascii text and covert to lowercase
filename = "doom_lyrics_10.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for c, i in char_to_int.items())

# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  13326
Total Vocab:  46


In [26]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100

dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  13226


In [27]:
import torch
import torch.nn as nn
import torch.optim as optim

# reshape X to be [samples, time steps, features]
X = torch.tensor(dataX, dtype=torch.float32).reshape(n_patterns, seq_length, 1)
X = X / float(n_vocab)
y = torch.tensor(dataY)
print(X.shape, y.shape)

torch.Size([13226, 100, 1]) torch.Size([13226])


#### Model def

In [28]:
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data

class CharModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=500, num_layers=2, batch_first=True, dropout=0.2)
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(500, n_vocab)
    def forward(self, x):
        x, _ = self.lstm(x)
        # take only the last output
        x = x[:, -1, :]
        # produce output
        x = self.linear(self.dropout(x))
        return x

#### Hyperparams and batching

In [29]:
n_epochs = 50
batch_size = 128
model = CharModel()

model.to(device)

X = X.to(device)
y = y.to(device)
 
optimizer = optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss(reduction="sum")
loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=batch_size)

#### Training

In [30]:
best_model = None
best_loss = np.inf
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    # Validation
    model.eval()
    loss = 0
    with torch.inference_mode():
        for X_batch, y_batch in loader:
            y_pred = model(X_batch)
            loss += loss_fn(y_pred, y_batch)
        if loss < best_loss:
            best_loss = loss
            best_model = model.state_dict()
        print("Epoch %d: Cross-entropy: %.4f" % (epoch, loss))
 
torch.save([best_model, char_to_int], "doom_bigger.pth")

Epoch 0: Cross-entropy: 40262.7773
Epoch 1: Cross-entropy: 39590.3281
Epoch 2: Cross-entropy: 38013.3828
Epoch 3: Cross-entropy: 36848.4453
Epoch 4: Cross-entropy: 36428.7383
Epoch 5: Cross-entropy: 35954.5273
Epoch 6: Cross-entropy: 36024.7188
Epoch 7: Cross-entropy: 34768.0039
Epoch 8: Cross-entropy: 34082.7227
Epoch 9: Cross-entropy: 33125.3086
Epoch 10: Cross-entropy: 32378.8516
Epoch 11: Cross-entropy: 31193.6641
Epoch 12: Cross-entropy: 29897.7891
Epoch 13: Cross-entropy: 27899.1758
Epoch 14: Cross-entropy: 25787.2090
Epoch 15: Cross-entropy: 23417.8770
Epoch 16: Cross-entropy: 20840.6992
Epoch 17: Cross-entropy: 18289.4531
Epoch 18: Cross-entropy: 16143.0479
Epoch 19: Cross-entropy: 13825.5479
Epoch 20: Cross-entropy: 11637.7578
Epoch 21: Cross-entropy: 9663.5566
Epoch 22: Cross-entropy: 8113.5322
Epoch 23: Cross-entropy: 6587.5669
Epoch 24: Cross-entropy: 5276.6514
Epoch 25: Cross-entropy: 4259.2539
Epoch 26: Cross-entropy: 3258.2354
Epoch 27: Cross-entropy: 2554.8892
Epoch 28:

#### Eval

In [31]:
seq_length = 100
start = np.random.randint(0, len(raw_text)-seq_length)
prompt = raw_text[start:start+seq_length]

pattern = [char_to_int[c] for c in prompt]

In [32]:
output_length = 250

In [33]:
model.eval()
print('Prompt: "%s"' % prompt)
with torch.no_grad():
    for i in range(output_length):
        # format input array of int into PyTorch tensor
        x = np.reshape(pattern, (1, len(pattern), 1)) / float(n_vocab)
        x = torch.tensor(x, dtype=torch.float32).to(device)
        # generate logits as output from the model
        prediction = model(x)
        # convert logits into one character
        index = int(prediction.argmax())
        result = int_to_char[index]
        print(result, end="")
        # append the new character into the prompt for the next iteration
        pattern.append(index)
        pattern = pattern[1:]
print()
print("Done.")

Prompt: "he smack just missed her
there go a list of politics like henry kissinger
99% of rap's just a friend"
ly listener
i'm like, "these dudes must have some screws loose to hate y'all"
or a couple of ounces short of deuce-deuce or 8 ball
y'all know it's time for the end when the day come
buy a album, get rudely insulted over fake drums
same cds you get fo
Done.
