## Homework 6
### Language Modeling

Welcome to Homework 6! 

The homework contains several tasks. You can find the amount of points that you get for the correct solution in the task header. Maximum amount of points for each homework is _four_.

The **grading** for each task is the following:
- correct answer - **full points**
- insufficient solution or solution resulting in the incorrect output - **half points**
- no answer or completely wrong solution - **no points**

Even if you don't know how to solve the task, we encourage you to write down your thoughts and progress and try to address the issues that stop you from completing the task.

When working on the written tasks, try to make your answers short and accurate. Most of the times, it is possible to answer the question in 1-3 sentences.

When writing code, make it readable. Choose appropriate names for your variables (`a = 'cat'` - not good, `word = 'cat'` - good). Avoid constructing lines of code longer than 100 characters (79 characters is ideal). If needed, provide the commentaries for your code, however, a good code should be easily readable without them :)

Finally, all your answers should be written only by yourself. If you copy them from other sources it will be considered as an academic fraud. You can discuss the tasks with your classmates but each solution must be individual.

<font color='red'>**Important!:**</font> **before sending your solution, do the `Kernel -> Restart & Run All` to ensure that all your code works.**

In [1]:
!nvcc --version

/bin/bash: nvcc: command not found


In [2]:
!pip3 install --quiet torchtext==0.11.0 datasets torchinfo

[31mERROR: Could not find a version that satisfies the requirement torchtext==0.11.0 (from versions: 0.1.1, 0.2.0, 0.2.1, 0.2.3, 0.3.1, 0.4.0, 0.5.0, 0.6.0, 0.12.0)[0m
[31mERROR: No matching distribution found for torchtext==0.11.0[0m
You should consider upgrading via the '/gpfs/space/software/jupyterhub/python/jupyter/bin/python -m pip install --upgrade pip' command.[0m


In [4]:
from datasets import load_dataset
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchinfo import summary

Load the [Penn Treebank dataset](https://huggingface.co/datasets/ptb_text_only). Structurally, it is the same as the Wikitext-2 dataset used in the Lab 6. Please, refer to the Lab materials for more details on data structure and loading.

In [5]:
train_dataset = load_dataset("ptb_text_only", split="train")

Downloading builder script: 6.50kB [00:00, 1.94MB/s]                   
Downloading metadata: 2.15kB [00:00, 639kB/s]                    


Downloading and preparing dataset ptb_text_only/penn_treebank (download: 5.68 MiB, generated: 5.72 MiB, post-processed: Unknown size, total: 11.40 MiB) to /gpfs/space/home/chenghan/.cache/huggingface/datasets/ptb_text_only/penn_treebank/1.1.0/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/1.70M [00:00<?, ?B/s][A
Downloading data: 5.10MB [00:00, 27.0MB/s]                           [A
Downloading data files:  33%|███▎      | 1/3 [00:01<00:03,  1.56s/it]
Downloading data: 400kB [00:00, 6.30MB/s]                   [A
Downloading data files:  67%|██████▋   | 2/3 [00:02<00:00,  1.05it/s]
Downloading data: 450kB [00:00, 7.12MB/s]                   [A
Downloading data files: 100%|██████████| 3/3 [00:02<00:00,  1.14it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 260.29it/s]
                                                                                       

Dataset ptb_text_only downloaded and prepared to /gpfs/space/home/chenghan/.cache/huggingface/datasets/ptb_text_only/penn_treebank/1.1.0/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f. Subsequent calls will reuse this data.


In [6]:
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_dataset['sentence']), specials=['<pad>', '<unk>', '<bos>', '<eos>'])
vocab.set_default_index(vocab['<unk>'])

In [7]:
import torch
from torch import nn, Tensor
from torch.utils.data import dataset

def data_process(raw_text_iter: dataset.IterableDataset, device: torch.device) -> Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(['<bos>']) + vocab(tokenizer(item["sentence"])), dtype=torch.long, device=device) for item in raw_text_iter]
    return list(filter(lambda t: t.numel() > 2, data))

In [8]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

dataset = load_dataset("ptb_text_only")
train_data = data_process(dataset["train"], device)
val_data = data_process(dataset["validation"], device)
test_data = data_process(dataset["test"], device)

Reusing dataset ptb_text_only (/gpfs/space/home/chenghan/.cache/huggingface/datasets/ptb_text_only/penn_treebank/1.1.0/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f)
100%|██████████| 3/3 [00:00<00:00, 250.35it/s]


In [9]:
from torch.nn.utils.rnn import pad_sequence
def _collate_fn(batch):
    inp = pad_sequence(batch, batch_first=True)
    target = pad_sequence([torch.cat((item[1:], torch.tensor(vocab['<eos>'], device=device).unsqueeze(0))) for item in batch], batch_first=True)
    return inp, target 

In [10]:
batch_size = 16

train_dataloader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, collate_fn=_collate_fn, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(val_data, batch_size=batch_size, collate_fn=_collate_fn, shuffle=False)
test_dataloader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, collate_fn=_collate_fn, shuffle=False)

## Task 1. Highway Model (3.5 points)

In this task, you will have to modify the model from the Lab 6. 

You will have to add:
- Three more convolutional layers
- Highway layer

To add extra convolutional layers, you can just copy `conv_block_1` and call it `conv_block_2`, for example.

Then, you will need to add a `gate_layer` which is a simple linear layer that outputs the same dimension as it takes. 

In the forward pass, add a `transform_gate` which is a nonlinear transformation of the embedded inputs, i.e. the gate layer followed by a sigmoid. Then, add a `carry_gate` which is simply `1 - transform_gate`. Finally, the output of the highway layer is the element-wise multiplication of the input by the carry gate plus the element-wise multiplication of the previous layer output and the transform gate. You can see more information about the highway layer [here](https://paperswithcode.com/method/highway-layer).

Finally, carry the output of the highway layer to the next convolutional block.

Hint: To perform an element-wise multiplication of tensor `a` and `b`, you can do `a * b` or `torch.mul(a, b)`.

In [11]:
class CNNLM(nn.Module):
    def __init__(self, num_words, emb_dim, hid_dim, kernel_size, tie_weights=False):
        super().__init__()
        self.emb = nn.Embedding(num_words, emb_dim, padding_idx=0)
        pad_size = kernel_size - 1
        self.conv_block_1 = nn.Sequential(nn.ConstantPad1d((pad_size, 0), 0),
                                  nn.Conv1d(emb_dim, hid_dim, kernel_size), 
                                  nn.ConstantPad1d((pad_size, 0), 0),
                                  nn.Conv1d(hid_dim, hid_dim, kernel_size),
                                  nn.ConstantPad1d((pad_size, 0), 0), 
                                  nn.Conv1d(hid_dim, hid_dim, kernel_size))
        
        # TODO: Define another three convolutional layers
        self.conv_block_2 = nn.Sequential(nn.ConstantPad1d((pad_size, 0), 0),
                                  nn.Conv1d(emb_dim, hid_dim, kernel_size), 
                                  nn.ConstantPad1d((pad_size, 0), 0),
                                  nn.Conv1d(hid_dim, hid_dim, kernel_size),
                                  nn.ConstantPad1d((pad_size, 0), 0), 
                                  nn.Conv1d(hid_dim, hid_dim, kernel_size))

        # TODO: Define a highway gate layer
        
        self.gate_layer = nn.Linear(hid_dim,hid_dim)
        self.sigmoid = nn.Sigmoid()
        self.lin_out = nn.Linear(hid_dim, num_words)

        if tie_weights:
            assert emb_dim == hid_dim, "To tie the weights, the embedding and hidden dimensions must be the same!"
            self.lin_out.weight = self.emb.weight

    def forward(self, x):
        x_emb = self.emb(x)

        x_conv = x_emb.permute(0, 2, 1)
        x_conv = self.conv_block_1(x_conv)
        x_conv = x_conv.permute(0, 2, 1)

        # TODO: Calculate the transform gate
       
        transform_gate = self.sigmoid(self.gate_layer(x_conv))
        
        # TODO: Calculate the carry gate
        carry_gate = 1-transform_gate
        # TODO: Calculate the output of the highway layer
        highway_out = torch.mul(x_conv,transform_gate)+torch.mul(x_emb,carry_gate)

        highway_out = highway_out.permute(0, 2, 1)
        # TODO: Pass the new inputs to the next convolutional block
        x_conv = self.conv_block_2(highway_out)
        x_conv = x_conv.permute(0, 2, 1)

        out = self.lin_out(x_conv)
        return out

In [12]:
print("Current device is:", device)

Current device is: cuda


In [13]:
num_words = len(vocab)
emb_dim = 300
hid_dim = 300
kernel_size = 3
tie_weights = True

model = CNNLM(num_words, emb_dim, hid_dim, kernel_size, tie_weights=tie_weights)
model = model.to(device)

By running the `summary` function from `torchinfo`, you can test if you network performs a forward pass without any errors. If you did everything correctly, the output should be similar to this:

```
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
CNNLM                                    --                        --
├─Embedding: 1-1                         [16, 85, 300]             2,977,500
├─Sequential: 1-2                        [16, 300, 85]             --
│    └─ConstantPad1d: 2-1                [16, 300, 87]             --
│    └─Conv1d: 2-2                       [16, 300, 85]             270,300
│    └─ConstantPad1d: 2-3                [16, 300, 87]             --
│    └─Conv1d: 2-4                       [16, 300, 85]             270,300
│    └─ConstantPad1d: 2-5                [16, 300, 87]             --
│    └─Conv1d: 2-6                       [16, 300, 85]             270,300
├─Linear: 1-3                            [16, 85, 300]             90,300
├─Sequential: 1-4                        [16, 300, 85]             --
│    └─ConstantPad1d: 2-7                [16, 300, 87]             --
│    └─Conv1d: 2-8                       [16, 300, 85]             270,300
│    └─ConstantPad1d: 2-9                [16, 300, 87]             --
│    └─Conv1d: 2-10                      [16, 300, 85]             270,300
│    └─ConstantPad1d: 2-11               [16, 300, 87]             --
│    └─Conv1d: 2-12                      [16, 300, 85]             270,300
├─Linear: 1-5                            [16, 85, 9925]            2,987,425
==========================================================================================
Total params: 7,677,025
Trainable params: 7,677,025
Non-trainable params: 0
Total mult-adds (G): 2.30
==========================================================================================
Input size (MB): 0.01
Forward/backward pass size (MB): 134.10
Params size (MB): 30.71
Estimated Total Size (MB): 164.82
==========================================================================================
```

In [None]:
max_seq_len = max(batch[0].size(1) for batch in train_dataloader)

summary(model, input_size=(batch_size, max_seq_len), dtypes=[torch.long])

In [15]:
print(f"Number of trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

Number of trainable parameters: 4,699,525


In [16]:
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = torch.optim.Adam(model.parameters())

Train your model for some epochs. You will see that at some point training loss and perplexity will keep decreasing while the validation loss and perplexity will start increasing. This means that your model starts overfitting and you can stop the training. 

In [17]:
n_epochs = 100
print_each = len(train_dataloader) // 5
total_steps = 0

for i in range(n_epochs):
    # Set the model to the training mode
    model.train()
    # Iterate through each batch in the training dataloader
    for step, (inputs, target) in enumerate(train_dataloader):
        # Zero the gradients to prevent explosion
        optimizer.zero_grad()
        # Predict the output
        pred = model(inputs)
        # Calculate the loss
        loss = criterion(pred.view(-1, pred.size(2)), target.flatten())
        # Backward pass on the loss
        loss.backward()
        # Update the model's weights
        optimizer.step()

        # Print out the training progress
        if step % print_each == 0 and step > 0:
            print(f"Step [{total_steps}/{len(train_dataloader) * n_epochs}] | Train Loss: {loss.item()} | Train Perplexity: {torch.exp(loss).item()}")
        total_steps += 1

    # Set the model to the evaluation mode
    model.eval()
    total_val_loss = 0
    # Turn off the gradient recording 
    with torch.no_grad():
        for inputs, target in val_dataloader:
            pred = model(inputs)
            loss = criterion(pred.view(-1, pred.size(2)), target.flatten())
            total_val_loss += loss

    val_loss = total_val_loss / len(val_dataloader)
    print(f"Epoch {i} | Val Loss: {val_loss.item()} | Val Perplexity: {torch.exp(val_loss).item()}")
    
    print(f"Saving the model to cnn_lm_highway_{i}.pt...")
    torch.save(model, f"cnn_lm_highway_{i}.pt")

Step [524/262100] | Train Loss: 6.280882835388184 | Train Perplexity: 534.2601318359375
Step [1048/262100] | Train Loss: 5.905404567718506 | Train Perplexity: 367.0157165527344
Step [1572/262100] | Train Loss: 6.005997657775879 | Train Perplexity: 405.855712890625
Step [2096/262100] | Train Loss: 5.821268081665039 | Train Perplexity: 337.399658203125
Step [2620/262100] | Train Loss: 5.792871952056885 | Train Perplexity: 327.95355224609375
Epoch 0 | Val Loss: 5.782670974731445 | Val Perplexity: 324.6251220703125
Saving the model to cnn_lm_highway_0.pt...
Step [3145/262100] | Train Loss: 5.585460662841797 | Train Perplexity: 266.5230407714844
Step [3669/262100] | Train Loss: 5.27031135559082 | Train Perplexity: 194.47650146484375
Step [4193/262100] | Train Loss: 5.439305305480957 | Train Perplexity: 230.28213500976562
Step [4717/262100] | Train Loss: 5.489867210388184 | Train Perplexity: 242.22503662109375
Step [5241/262100] | Train Loss: 5.562743663787842 | Train Perplexity: 260.5366821

Load the model with the lowest validation perplexity and evaluate it on the test set.



In [19]:
# Epoch 17 has lowest result : Epoch 17 | Val Loss: 5.090243339538574 | Val Perplexity: 162.4293975830078
model = torch.load('cnn_lm_highway_17.pt')

In [20]:
model.eval()
total_test_loss = 0
# Turn off the gradient recording 
with torch.no_grad():
    for inputs, target in test_dataloader:
        pred = model(inputs)
        loss = criterion(pred.view(-1, pred.size(2)), target.flatten())
        total_test_loss += loss

test_loss = total_test_loss / len(test_dataloader)
print(f"Test Loss: {test_loss.item()} | Test Perplexity: {torch.exp(test_loss).item()}")

Test Loss: 5.009646415710449 | Test Perplexity: 149.8517303466797


## Task 2. Generating Text (0.5 points)

Try different sentence seeds and temperature values from generating new sentences. Report some of the generated sentences. Report on how the temperature affects the generated sentences.

In [38]:
max_len = 100
temperature = 20

test_sent = 'will smith hits'.split()
test_sent = vocab(test_sent)
test_sent = [vocab['<bos>']] + test_sent
test_sent = torch.tensor(test_sent).unsqueeze(0).to(device)

def gen_sent_by_temperature(test_sent, temperature):
    model.eval()
    with torch.no_grad():
        while True:
            pred = model(test_sent)
            pred[:, :, 1] = pred[:, :, 1] * 1e-6
            next_token = torch.multinomial(torch.softmax(pred / temperature, dim=2)[:,-1], 1)
            test_sent = torch.cat((test_sent, next_token), dim=1)
            if next_token.item() == vocab['<eos>']:
                break
            if test_sent.size(1) == max_len:
                break

    print(f'In  temperature {temperature}:',' '.join(vocab.lookup_tokens(test_sent.squeeze().tolist())))
    print('*******'*10)

temp_lst=[*range(1, 41, 5)]
for tmp in temp_lst:
    gen_sent_by_temperature(test_sent, tmp)

In  temperature 1: <bos> will smith hits . n n about n n he said <eos>
**********************************************************************
In  temperature 6: <bos> will smith hits lion remic kuwait kemper continued avon prime fiduciary opponents identify investor achievement alcohol solutions players need troubling tro creek tabloid fairness joint misconduct unified pro-life jittery fighter 24-hour headaches centennial fear frank slashing concern listings bills critic phones huge tall store aided home craft boosted up count force jet banking scenario acquires americans reforms cyclical banxquote swelled eurobonds laying appliances lifting federated mesa remainder reasons road danny actor rout tobacco wealthy trader maximize health asbestos returned disk cans kemper program highways barrier write-downs corp mail-order killing racked eric critical to gridlock significantly architecture novelist he formal
**********************************************************************
In  temper

**(A) :** Additional temperature variable $\theta$ which affects the softmax distribution. A higher temperature $\theta$ “excites” previously low probability outputs. A lower temperature $\theta$ lowers the smaller outputs relative to the largest outputs [1].

## Refercence

[1] [What is Temperature in NLP?](https://lukesalamone.github.io/posts/what-is-temperature/)