Name: Mohammad Hossein Sameti

---



Student ID: 401204932

In this exercise, you should develop a character-level RNN language model.

You are free to choose the architecture, but you must use GRUs and not LSTMs. A linear embedding layer (hidden size 64), a 2-layer GRU (hidden size 128, dropout 0.1), and a linear classifier head is an example architecture.

You should generate some example outputs using beam search.

Some parts of the code has been done for you. You need to implement the parts that raise `NotImplementedError`.

The index zero has been reserved for the padding token/character. By subtracting one from the token indices, the indices will become ASCII indices. (And the padding index will become `-1`.)

The model's classification head should directly predict ASCII characters (256 possibilities). It should not predict any special tokens, such as padding, start or end.

# Bootstrap

# Install

In [1]:
! pip install -U torch datasets pyperclip icecream numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyperclip
  Downloading pyperclip-1.8.2.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting icecream
  Downloading icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Collecting numpy
  Downloading numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m58.4 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212

# Download the Data

In [2]:
!wget https://files.lilf.ir/Black%20Luminary.txt

--2023-04-27 17:50:35--  https://files.lilf.ir/Black%20Luminary.txt
Resolving files.lilf.ir (files.lilf.ir)... 82.102.11.148
Connecting to files.lilf.ir (files.lilf.ir)|82.102.11.148|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3148450 (3.0M) [text/plain]
Saving to: ‘Black Luminary.txt’


2023-04-27 17:50:38 (1.93 MB/s) - ‘Black Luminary.txt’ saved [3148450/3148450]



In [3]:
! ls -lh

total 3.1M
-rw-r--r-- 1 root root 3.1M Oct 14  2021 'Black Luminary.txt'
drwxr-xr-x 1 root root 4.0K Apr 26 17:36  sample_data


In [4]:
! realpath *.txt

/content/Black Luminary.txt


# User Config

In [5]:
data_paths = [
    '/content/Black Luminary.txt',
    ]

## imports

In [6]:
import pyperclip

In [7]:
from icecream import ic

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [9]:
device = torch.device("cpu")
#: We will set device again in the training loop.

In [10]:
import datasets as D

In [11]:
import numpy
np = numpy

import statistics

# Utils

In [12]:
class NumpyPrintOptions:
    def __init__(self, **kwargs):
        self.options = kwargs
        self.original_options = np.get_printoptions()

    def __enter__(self):
        np.set_printoptions(**self.options)

    def __exit__(self, exc_type, exc_value, traceback):
        np.set_printoptions(**self.original_options)

class NoTruncationNumpyPrintOptions(NumpyPrintOptions):
    def __init__(self):
        super().__init__(
            threshold=np.inf, 
            linewidth=200, 
            suppress=True, 
            precision=4
        )

In [13]:
import jax

def torch_shape_get(input):
    def h_shape_get(x):
        return x.dtype, x.shape

    return jax.tree_map(h_shape_get, input)

In [14]:
def has_nan(tensor):
    return torch.any(torch.isnan(tensor))

In [15]:
class ModelEvalMode:
    def __init__(self, model):
        self.model = model

    def __enter__(self):
        self.model.eval()

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.model.train()

# Data

In [16]:
d = D.load_dataset("text",
                         data_files=data_paths, sample_by="paragraph")
d

Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-751f099585d07f63/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-751f099585d07f63/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 18423
    })
})

In [17]:
d = d['train']
d

Dataset({
    features: ['text'],
    num_rows: 18423
})

In [18]:
d[1000]

{'text': 'Professor Snape threw him backwards, and Harry stumbled, but just managed to keep standing.'}

In [19]:
def str_to_np(s, dtype=np.int8):
    s = s.encode('ascii', errors='ignore')
    return np.frombuffer(s, dtype=dtype)

str_to_np('hello')

array([104, 101, 108, 108, 111], dtype=int8)

In [20]:
def str_to_onehot(s):
    return np.eye(256)[str_to_np(s)]

In [21]:
dc = d.map(lambda batch: {'input': [str_to_np(t).astype(np.int32) + 1 for t in batch['text']]}, batched=True) #: added one to the char indices to make zero available for the pad token
dc

Map:   0%|          | 0/18423 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'input'],
    num_rows: 18423
})

In [22]:
dc = dc.filter(lambda x: (len(x['input']) > 30 and len(x['text'].split()) > 4), batched=False)
dc

Filter:   0%|          | 0/18423 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'input'],
    num_rows: 16371
})

In [23]:
dc.set_format("torch", columns=["input",])

In [24]:
torch_shape_get(dc[1000:1010]['input'])

[(torch.int64, torch.Size([1126])),
 (torch.int64, torch.Size([727])),
 (torch.int64, torch.Size([163])),
 (torch.int64, torch.Size([232])),
 (torch.int64, torch.Size([106])),
 (torch.int64, torch.Size([88])),
 (torch.int64, torch.Size([82])),
 (torch.int64, torch.Size([69])),
 (torch.int64, torch.Size([127])),
 (torch.int64, torch.Size([64]))]

In [25]:
dc = dc.shuffle()

In [26]:
dcs = dc.train_test_split(test_size=0.2)
dcs

DatasetDict({
    train: Dataset({
        features: ['text', 'input'],
        num_rows: 13096
    })
    test: Dataset({
        features: ['text', 'input'],
        num_rows: 3275
    })
})

# Model

- [GRU --- PyTorch 2.0 documentation](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html)

- [torch.nn.utils.rnn.pack_sequence --- PyTorch 2.0 documentation](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_sequence.html#torch.nn.utils.rnn.pack_sequence) (not necessarily needed)

- [Embedding --- PyTorch 2.0 documentation](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)

- [torch.nn.utils.rnn.pad_sequence --- PyTorch 2.0 documentation](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html)


In [27]:
import torch
import torch.nn as nn
import statistics
from torch.nn.utils.rnn import pad_sequence
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(257, 64, padding_idx=0)  # 256 is the number of possible characters
        self.gru = nn.GRU(64, 128, num_layers=2, dropout=0.1, batch_first=True)  # 2-layer GRU with 128 hidden units and 0.1 dropout
        self.fc = nn.Linear(128, 257)  # output a probability distribution over 257 possible characters (256 characters + 1 padding)
        

    def forward(self, x, hidden=None):

        x = pad_sequence(x, batch_first=True)
        emb = self.embedding(x)# apply embedding layer
        #x = pad_sequence(emb, batch_first= True)
        out, hidden = self.gru(emb, hidden)  # pass through GRU layers
        out1 = self.fc(out)# pass through linear classifier
        return out1, hidden


In [42]:
def loss_fn(y, y_hat):
  y = y.view(-1) -1
  y_hat = y_hat.view(-1, 257) -1
  mask = y != -1
  y = y[mask]
  y_hat = y_hat[mask]
  total_loss = nn.CrossEntropyLoss(reduction="sum")(y_hat, y)
  
  return total_loss / y.shape[0]


In [29]:
def shift_left(tensor_list, pad_value=0.0):
  
    shifted_tensors = []
    for tensor in tensor_list:
        #raise NotImplementedError()
        s = torch.roll(tensor, -1)
        s[-1] = 0
        shifted_tensors.append(s)

    return shifted_tensors

# Example usage:
input_ids = [torch.tensor([1, 2, 3, 4]), torch.tensor([5, 6, 7, 8])]
print("Input Ids:")
print(input_ids)

target_ids = shift_left(input_ids)
print("Shifted Left (Target Ids):")
print(target_ids)

Input Ids:
[tensor([1, 2, 3, 4]), tensor([5, 6, 7, 8])]
Shifted Left (Target Ids):
[tensor([2, 3, 4, 0]), tensor([6, 7, 8, 0])]


# Beam Search Generation

In [30]:
import torch
import heapq

def tensor_to_string(tensor):
    chars = [chr(c) for c in tensor]
    return ''.join(chars)

def tensor_append_scalar(tensor, scalar):
    scalar_tensor = torch.tensor(scalar).view(1)  # Add a dimension to match the original tensor's dimensions
    scalar_tensor = scalar_tensor.to(device)

    # Append the scalar to the original tensor
    result = torch.cat((tensor, scalar_tensor), dim=0)
    return result


def generate_next_top_k(model, input_sequence, k):
    logits, _ = model.forward([input_sequence])
    logits = logits[0, -1, :]
    # ic(torch_shape_get(logits))
    
    probabilities = torch.softmax(logits, dim=-1)
    # ic(torch_shape_get(probabilities))

    top_k_values, top_k_indices = torch.topk(probabilities, k)

    return [(tensor_append_scalar(input_sequence, idx.item() + 1), log_prob.item()) for idx, log_prob in zip(top_k_indices, top_k_values.log())]

def beam_search(model, desired_length, starting_string, k=5):
    with ModelEvalMode(model), torch.no_grad():
      input_sequence = torch.tensor(str_to_np(starting_string).astype(np.int32) + 1, dtype=torch.long)
      input_sequence = input_sequence.to(device)
      # ic(torch_shape_get(input_sequence))
      
      log_prob = 0.0

      beam = [(input_sequence, log_prob)]

      while len(beam[0][0]) < desired_length:
          new_beam = []
          for seq, log_prob in beam:
              next_top_k = generate_next_top_k(model, seq, k)
              new_beam.extend([(new_seq, new_log_prob + log_prob) for new_seq, new_log_prob in next_top_k])

          beam = heapq.nlargest(k, new_beam, key=lambda x: x[1])

      return [tensor_to_string(seq - 1) for seq, _ in beam]

In [31]:
a = str_to_np("Harry ").astype(np.int32) + 1
ic(a)
tensor_to_string(a -1)

ic| a: array([ 73,  98, 115, 115, 122,  33], dtype=int32)


'Harry '

In [32]:
def eval_gen(*args, display=999999, **kwargs):
    generated_texts = beam_search(
        *args, **kwargs,
    )

    for idx, text in enumerate(generated_texts):
        if idx >= display:
            break
        
        print(f"Generated text {idx + 1}: {text}")

# Train

In [33]:
dt = dcs['train']
dt

Dataset({
    features: ['text', 'input'],
    num_rows: 13096
})

In [34]:
torch.cuda.empty_cache()

In [35]:
val = dcs['test']
val = val['input']
val = pad_sequence(val, batch_first=True)
val.shape

torch.Size([3275, 1445])

In [36]:

sorted_tensor_list = sorted(dt['input'], key=lambda x: x.size(0))


In [37]:
dt = sorted_tensor_list

In [38]:
dt[0].shape

torch.Size([31])

In [43]:

if torch.cuda.is_available():
    device = 'cuda'
    non_blocking = True
elif True:
    device = 'cpu'
    non_blocking = False
else:
    #: causes NaNs
    device = 'mps'
    non_blocking = False

i = 0

#: Feel free to edit these hyperparameters or the optimizer
#: You might want to use a learning-rate scheduler, such as
#: [ReduceLROnPlateau --- PyTorch 2.0 documentation](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html)
epochs = 400
batch_size = 4096
learning_rate = 0.01
max_len = 0
acc_step = 10
m = Model().to(device=device, non_blocking=non_blocking)
m.train()

optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min')
counter = 0


for epoch in range(epochs):
	#dt = dt.shuffle()
	 
	for i in range(0, len(dt), batch_size):
		inputs = dt[i:i+batch_size]
		#inputs = batch['input']
		if max_len > 0:
			inputs = list(map(lambda x: x[:max_len] if len(x) > max_len else x, inputs))
		lens = [len(seq) for seq in inputs]
		current_max_len = max(lens)
		mean_len = statistics.mean(lens)
		total_loss = 0
		
		
		hidden=None
		inputs = pad_sequence(inputs, batch_first=True, padding_value=0)
		for j in range(0, current_max_len, 40):
			x = list(map(lambda x: x.to(device), inputs[:, j:j+40]))
			#x = inputs[:, j:j+20]
			#x = torch.stack(x)
			y = shift_left(x)
			y = torch.stack(y)
			if hidden is not None:
				hidden = hidden.clone().requires_grad_(True) 
			y_hat, hidden = m(x, hidden=hidden)
			loss = loss_fn(y , y_hat)
          
			loss.backward()
			total_loss += loss.item()
			hidden = hidden.detach()
		optimizer.step()
		optimizer.zero_grad()

		l = total_loss / mean_len
		
		if counter % 1 == 0:
          #l = l.item()
			print(f"loss: {l:>7f}  [{counter:>5d}, epoch={epoch}]")


		counter += 1	
	if epoch % 15 == 0:
		eval_gen(display=3, model=m, desired_length=100, starting_string="Harry ", k=32)

loss: 0.266379  [    0, epoch=0]
loss: 0.195401  [    1, epoch=0]
loss: 0.158457  [    2, epoch=0]
loss: 0.188191  [    3, epoch=0]
Generated text 1: Harry oerererererererererererererererererererererererererererererererererererererererererererererere
Generated text 2: Harry ererererererererererererererererererererererererererererererererererererererererererererererer
Generated text 3: Harry oerererererererererererererererererererererererererererererereerererererererererererererererer
loss: 0.166151  [    4, epoch=1]
loss: 0.133224  [    5, epoch=1]
loss: 0.149799  [    6, epoch=1]
loss: 0.194010  [    7, epoch=1]
loss: 0.159208  [    8, epoch=2]
loss: 0.120008  [    9, epoch=2]
loss: 0.135306  [   10, epoch=2]
loss: 0.184239  [   11, epoch=2]
loss: 0.156213  [   12, epoch=3]
loss: 0.117514  [   13, epoch=3]
loss: 0.129122  [   14, epoch=3]
loss: 0.170907  [   15, epoch=3]
loss: 0.149443  [   16, epoch=4]
loss: 0.113030  [   17, epoch=4]
loss: 0.124537  [   18, epoch=4]
loss: 0.167510  

In [44]:
eval_gen(display=3, model=m, desired_length=100, starting_string="Harry ", k=32)

Generated text 1: Harry little gang around them clearly enough had taken cover behind desks, she had taken cover behin
Generated text 2: Harry little gang around them clearly enough had taken cover behind desks into very angry wolfhounds
Generated text 3: Harry little gang around them clearly enough so that they might react of the desks into very angry t


In [45]:
eval_gen(display=50, model=m, desired_length=250, starting_string="Harry ", k=100)

Generated text 1: Harry students had taken cover behind desks into very angry wolfhounds that had quickly cornered them clearly with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered that had quickly cornered t
Generated text 2: Harry students had taken cover behind desks into very angry wolfhounds that had quickly cornered them clearly with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered them clearly enough so that
Generated text 3: Harry students had taken cover behind desks into very angry wolfhounds that had quickly cornered them clearly with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered them clearly cornered them 
Generated text 4: Harry students had taken cover behind desks into very angry wolfhounds that had quickly cornered them clearly with a wave of her wand, she had transfigured all the desks into 

In [46]:
eval_gen(display=50, model=m, desired_length=250, starting_string="Arcturus ", k=100)

Generated text 1: Arcturus with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered them clearly with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered th
Generated text 2: Arcturus with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered them clearly with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered he
Generated text 3: Arcturus with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered them clearly with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered to
Generated text 4: Arcturus with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered them clearly with a wave of her wand, she had transfi

In [47]:
eval_gen(display=50, model=m, desired_length=150, starting_string="Draco ", k=100)

Generated text 1: Draco with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered them clearly enough so that the
Generated text 2: Draco with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered them clearly cornered the rest 
Generated text 3: Draco with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered that had quickly cornered them 
Generated text 4: Draco with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered that had quickly cornered them.
Generated text 5: Draco with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered them had taken cover behind des
Generated text 6: Draco with a wave of her wand, she had transfigured all the desks into very angry wolfhounds that had quickly cornered them clearly desks

In [48]:
eval_gen(display=50, model=m, desired_length=150, starting_string="Harry looked at ", k=100)

Generated text 1: Harry looked at her wand, she had taken cover behind desks, she had taken cover behind desks into very angry wolfhounds that had quickly cornered them
Generated text 2: Harry looked at her wand, she had taken cover behind desks into very angry wolfhounds that had quickly cornered them clearly cornered the rest of them
Generated text 3: Harry looked at her wand, she had taken cover behind desks into very angry wolfhounds that had quickly cornered them clearly enough so that they might
Generated text 4: Harry looked at her wand, she had taken cover behind desks, she had taken cover behind desks into very angry wolfhounds that had quickly cornered the 
Generated text 5: Harry looked at her wand, she had taken cover behind desks into very angry wolfhounds that had quickly cornered them clearly wolfhounds that had quick
Generated text 6: Harry looked at her wand, she had taken cover behind desks into very angry wolfhounds that had quickly cornered that had quickly cornered