# Homework 4
This homework will guide you step by step to train a CharRNN. 
This homework will show you how to use pytorch for building the pipline of deep learning task, including  

* Loading data
* Creating and training a DNN model (more specifically, build and train a LSTM model) 
* Using a trained model


In [1]:
!pip install torch torchvision pandas tensorboard



You are using pip version 19.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
import torch 
from torch import nn
from torch.utils.data import DataLoader, Dataset
from torch.utils.tensorboard import SummaryWriter

import pandas as pd
import numpy as np
import os
import shutil
import time
import random
import copy

In [3]:
%load_ext tensorboard

In [4]:
# check GPU status
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print("the {}th GPU is: {}".format(i, torch.cuda.get_device_name(0)))
else:
    print("Can not find any GPU!!! Please check your colab's setting.")

the 0th GPU is: GeForce GTX 970M


## Building deep learning pipeline

## Basic Introduction
This section will introduce all important basic pieces that pytorch offers. You will need this for whatever DNN task you want to deal with.
1. Load data
2. Build DNN architecture
3. Train and evaluate a DNN
4. Visualizatoin


### Loading Dataset (10 points in total)
This is not a function that one has to use pytorch for this purpose. But, the framework of data loading offered by pytorch can really make one's life easier. Let's start.

The class, ```torch.utils.data.Dataset```, is the key constructor for createing dataloasers. Any dataloader class must inherit the ```Dataset``` and implement three functions: ```__init__()```, ```__len__()```, ```__getitem__()```.

* ```__init__(self, data, **argv)``` takes the input dataset for the  initialization. It will store the data so that other functions in the class can use it. One can also pass other necessory parameters though this initial function. 
* ```__len__(self)``` returns the size of the input dataset.
* ```__getitem__(self, i)``` takes an integer $i$ as input and returns the $i$th sample of the dataset. 

---

For more detail description of the ```Dataset```, please visit https://pytorch.org/docs/stable/data.html.

---


<font color=red>
In the next cell, you will create your first dataloader with following assumptions: 

1. the dataset is stored as a list of tuples. 
2. In each tuple, the first element is an input data and the second element is the label of the data. 
3. Each element can be integer, float number or numpy array. The baseline is all the input data should be the same type and all the output labels should also be consistant on the data types. 
4. The output of ```__getitem__()``` is a dictionary with two keys: input, output

</font>


In [5]:
class yourDataset(Dataset):
    def __init__(self, data):
        self.data = data # store the data 

    def __len__(self):
        length = None

        return len(self.data)

    def __getitem__(self, i):
        assert isinstance(i, int)
        sample = {'input' : None, 
                  'output': None}
        
        return {'input': self.data[i][0], 'output': self.data[i][1],}

In [6]:
##############################################################
#
#  If your implementation is correct, your function should pass all the following tests
#
##############################################################
random.seed(0)

test1_dataset = [(random.random(), random.randint(0,9)) for i in range(5)]
test1 = yourDataset(test1_dataset) 
assert len(test1_dataset) == test1.__len__(), BaseException("The len of the dataset is not correct")
for index, value in enumerate(test1_dataset):
    input, output = test1.__getitem__(index).values()
    assert input == value[0], BaseException("The output sample does NOT match with the {}th element of the input data, {},{}".format(index, input, value[0]))
    assert output == value[1], BaseException("The output sample does NOT match with the {}th element of the output data, {},{}".format(index, output, value[1]))


base_list = [random.random() for i in range(100)]
test2_dataset = [(random.sample(base_list, i), random.randint(0,9)) for i in range(10, 1, -1)]
test2 = yourDataset(test2_dataset) 
assert len(test2_dataset) == test2.__len__(), BaseException("The len of the dataset is not correct")
for index, value in enumerate(test2_dataset):
    input, output = test2.__getitem__(index).values()
    assert input == value[0], BaseException("The output sample does NOT match with the {}th element of the input data, {},{}".format(index, input, value[0]))
    assert output == value[1], BaseException("The output sample does NOT match with the {}th element of the output data, {},{}".format(index, output, value[1]))


### Make loading data easier (15 points in total)
Now you already successfully built your first dataloader with pytorch. But there is two inconvenient things: 
1. It doesn't   generate a randome. 
2. It can only return one sample at a time. What if we want a batch of samples?

You may already find it is a little bit strange that ```__getitem__()``` is a private function that  should **NOT** be directly called by us. 

The truth is that, this function is designed for ```torch.utils.data.DataLoader```. Please read the introduction of the ```DataLoader``` in here: https://pytorch.org/docs/stable/data.html. <font color=red> And answer following questions in following cells:</font>

Read the example code, and answer some questions which will require a little bit knowledge of python itself:
```
dataset = yourDataset(test1_dataset) 
dataloader = torch.utils.data.DataLoader(dataset, batch_size=1,
                        shuffle=True)
for index, batch in enumerate(dataloader):
    sample = batch
    break
```
1. Check the document that I mentioned earlier, explain the three input parameters for ```torch.utils.data.DataLoader``` in the second line. And what is the output of the ```DataLoader```(hint: the data type of the output)? (5 points)



The input parameters to the DataLoader object are as follows:
- dataset: A torch.utils.data.DataSet object that the DataLoader can query.
- batch_size: The number of data points the DataLoader should load when constructing a query.
- shuffle: Should the data loader shuffle the data before beginning the query.

The question to about the output of the DataLoader is unclear. If you mean what is the ouptut of the call to the DataLoader constructor then the answer is a torch
DataLoader object. If you are asking what is the result of querying the DataLoader, it is a generator that yeilds a batch of data where the specified dictionaries 
'input' and 'output' in this situation, contain a subset of the data of size 'batch_size'.

    
    
2. When loading any data, always be careful about the meaning of each dimention of the data!!!! All deep learning packages, not only Pytorch, assume you are responsible for making sure the dimentions of your data match the DNN layers' required the input dimentions. Run the code in the next cell. Which dimension of output batch is the batch size? Why?(5 points)

The batch size is dimension 0. It is this way because this means that each index in a batch corrisponds to a data point. 

In [7]:
random.seed(0)

base_list = [np.random.random() for i in range(100)]
test4_dataset = [(np.random.randint(0.0, 4.0, 5), np.random.randint(0.0, 10.0, 1)) for i in range(10, 1, -1)]
dataset = yourDataset(test4_dataset) 
print(dataset.__getitem__(0)["input"].shape)
print(dataset.__getitem__(0)["output"].shape)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=2, shuffle=True)

for index, batch in enumerate(dataloader):
    sample = batch
    break
    

input = sample["input"]
output = sample["output"]
print("batch (input shape):", input.shape)
print("batch (output shape):", output.shape)

(5,)
(1,)
batch (input shape): torch.Size([2, 5])
batch (output shape): torch.Size([2, 1])


#### Dealing with samples with different length
When dealing with sequential data, like sentineces and audio recordings, the length of sample is not fix. The problem is that the size of dimention of tensors in one batch much match. The solution is: one have to padding these tensors to make avoid this problem. In ```DataLoader```, parameter ```collate_fn``` take a function which will transform the selected samples into a tensor. By default, it will directly concatenate multiple tensor together. Our new funtion will first pad every tensor to the same shape, then concatenate them together. 


In the next 3 cells, you will see the consequence of unpadding and padding. 

<font color=red>
Read the 2nd cell carefully, explain why I sorted the samples before padding. (5 points)
</font>

I am not sure why you sorted the examples. I have removed your sorting in the cell below yours and the code seems to run fine. the padding capabilities of torch should be able to addriess padding issures without the data having to be in order of length. It appears that it does. I see that the first blcok fails because you  are not matching sizes.

Please see my code and let me know what I am missing here.


In [9]:
random.seed(0)

# make sure you understand these code and check the error message
base_list = [np.random.random() for i in range(100)]
test4_dataset = [(np.random.randint(0, 10, i), np.random.randint(0, 10, 1)) for i in range(10, 1, -1)]
dataset = yourDataset(test4_dataset) 
print(dataset.__getitem__(0)["input"].shape)
print(dataset.__getitem__(1)["input"].shape)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=2, shuffle=False)

for index, batch in enumerate(dataloader):
    sample = batch
    break

input = sample["input"]
output = sample["output"]
print("batch (input shape):", input.shape)
print("batch (output shape):", output.shape)

(10,)
(9,)


RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 10 and 9 in dimension 1 at C:\w\1\s\windows\pytorch\aten\src\TH/generic/THTensor.cpp:689

In [10]:
def padding(data):
    # get target shape
    spec_lengths = [i["input"].shape[0] for i in data]
    batch_size = len(data)
    sorted_index = np.argsort(spec_lengths)

    sorted_data = {"input": [], "output": []}

    for i in reversed(sorted_index):
        sorted_data["input"].append(torch.from_numpy(data[i]["input"]))
        sorted_data["output"].append(torch.from_numpy(data[i]["output"]))

    # nn.utils.rnn.pad_sequence consider the input tensor as either [B,T,F] or [T, B, F], depends on whether patch first
    return {"input": nn.utils.rnn.pad_sequence(sorted_data["input"], batch_first=True), 
            "output": nn.utils.rnn.pad_sequence(sorted_data["output"], batch_first=True),
            }


def padding(data):
    
    #### SWAP SHORTER/LONGER POSITION #######
    data.reverse()
    print(data)
    
    
    # get target shape
    spec_lengths = [i["input"].shape[0] for i in data]
    batch_size = len(data)
    
    ####### UNSORTED ######################
    sorted_index = range(len(spec_lengths))

    sorted_data = {"input": [], "output": []}

    for i in reversed(sorted_index):
        sorted_data["input"].append(torch.from_numpy(data[i]["input"]))
        sorted_data["output"].append(torch.from_numpy(data[i]["output"]))

    # nn.utils.rnn.pad_sequence consider the input tensor as either [B,T,F] or [T, B, F], depends on whether patch first
    ret =  {"input": nn.utils.rnn.pad_sequence(sorted_data["input"], batch_first=True), 
            "output": nn.utils.rnn.pad_sequence(sorted_data["output"], batch_first=True),
            }
    
    print(ret)
    
    return ret


In [11]:
# Use padding() instead of the default process. 
dataloader = torch.utils.data.DataLoader(dataset, batch_size=2, shuffle=False, collate_fn=padding)

for index, batch in enumerate(dataloader):
    sample = batch
    break

input = sample["input"]
output = sample["output"]
print("batch (input shape):", input.shape)
print("batch (output shape):", output.shape)

batch (input shape): torch.Size([2, 10])
batch (output shape): torch.Size([2, 1])


### Variable type
In the previous experiment, you might already notice that the ```DataLoader``` converts the data type from numpy array to other type. This is because Pytorch has its own data types and ```DataLoader``` will make sure it's output can be directly fed into the neural network built by Pytorch. 

Following are some examples about pytorch's tensor. 

In [12]:
# convert a numpy array to a torch.Tensor
np_array1 = np.random.randint(0, 4, (3,3))
torch_array1 = torch.from_numpy(np_array1)
print("np_array1: ", np_array1.dtype)
print("torch_array1: ", torch_array1.dtype)

np_array2 = np.random.random_sample((3,3))
torch_array2 = torch.from_numpy(np_array2)
print("np_array2: ", np_array2.dtype)
print("torch_array2: ", torch_array2.dtype)

np_array3 = np.random.random_sample((3,3)).astype(np.float32)
torch_array3 = torch.from_numpy(np_array3)
print("np_array3: ", np_array3.dtype)
print("torch_array3: ", torch_array3.dtype)

np_array1:  int32
torch_array1:  torch.int32
np_array2:  float64
torch_array2:  torch.float64
np_array3:  float32
torch_array3:  torch.float32


In [13]:
# convert a torch.Tensor from int64 to float64
torch_tensor_int = torch.tensor([1,1])
torch_tensor_float = torch_tensor_int.double()
print("torch_tensor_int: ", torch_tensor_int.dtype)
print("torch_tensor_float: ", torch_tensor_float.dtype)

torch_tensor_int:  torch.int64
torch_tensor_float:  torch.float64


In [14]:
# pytorch will automatically convert one type to another, if these are in the same device (GPU or CPU)
# here is an example of float32.cpu and float32.cuda
torch_tensor_int  = torch.ones((3,3), dtype=int).to("cuda")
torch_tensor_float = torch_tensor_int.float() # copy a cpu tensor to gpu

print("torch_tensor_int: ", torch_tensor_int.dtype,  torch_tensor_int.device) 
print("torch_tensor_float: ",torch_tensor_float.dtype,  torch_tensor_float.device) 

torch_tensor_float + torch_tensor_int

torch_tensor_int:  torch.int64 cuda:0
torch_tensor_float:  torch.float32 cuda:0


tensor([[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]], device='cuda:0')

In [16]:
# pytorch will not automatically convert one type to another, if these not are in the same device (GPU or CPU)
# here is an example of float32.cpu and float32.cuda
torch_tensor_cpu  = torch.ones((3,3))
torch_tensor_gpu1 = torch_tensor_cpu.to("cuda") # copy a cpu tensor to gpu

print("torch_tensor_cpu: ", torch_tensor_cpu.dtype,  torch_tensor_cpu.device) 
print("torch_tensor_gpu1: ",torch_tensor_gpu1.dtype,  torch_tensor_gpu1.device) 

torch_tensor_cpu + torch_tensor_gpu1

torch_tensor_cpu:  torch.float32 cpu
torch_tensor_gpu1:  torch.float32 cuda:0


RuntimeError: expected device cpu but got device cuda:0

### DNN Model (60 points in total)


In this part, you will deal with a simple deep learning task --- counting. That is, the neural net takes a sequence of $1$s as input, it will learn to generate same number of $2$s as output. The formal definition of the task is: give a input sequence $41^n3$, where 4 is the start symble, 3 is the delimiter and n $1$s are between 4 and 3, the neural network will output a sequence $2^n0$, where 0 is the stop symble.  

After this, we will deal with another more complex task -- copy which will be discussed later this section.

#### Building DNN architecture
The first step of training your own model is defining the architecture.
Here is an example of a small RNN defined in PyTorch.
```
class exampleRNN(torch.nn.Module):
    def __init__(self):
        super(exampleRNN, self).__init__()
        self.fc0 = torch.nn.Embedding(5, 4)
        self.rnn_layer = torch.nn.LSTM(input_size=4, 
                                       hidden_size=10, 
                                       num_layers=1, 
                                       bidirectional=False, 
                                       batch_first=True)

        self.dropout = torch.nn.Dropout(0.4)
        self.fc1 = torch.nn.Linear(10, 5)
        self.softmax = torch.nn.LogSoftmax(dim=-1)

    def forward(self, x, initial_hidden):
        x1 = self.fc0(x)
        x2, hidden = self.rnn_layer(x1, initial_hidden)
        x3 = self.dropout(x2)
        x4 = self.fc1(x3)
        x5 = self.softmax(x4)
        return x5, hidden
```

Above is a RNN architecture, all layers are initialized in ```__init___()```. And, ```forward()``` defines how those layers stack together when process the input data. You can find all those layers' description at [here](https://pytorch.org/docs/stable/index.html). <font color=red>Check the description of embedding layer, lstm layer and logsoftmax layer and answer following questions:
</font>
* For embedding layer, if the shape of x is (1, 8, 5), what should be the shape of x1? And what is the meaning of each dimension? More importantly, what is the data type that this tensor x should be. (3 points)
<font color=blue> 
The output shape should be (1, 8, 4). The dimensions are (# batches, # data points, # features). I dont understand the last question: float32? torch.Tensor? Sequence embedding?
</font>

* For lstm layer:
    1. How many elements are in initial_hidden? (2 points)
    <font color=blue> 
    \# Your answer here
    </font>
    2. If the shape of x1 is (1, 8, 4), what is the shape of each element in initial_hidden and the shape of ```x2```? (2 points)
    <font color=blue> 
    10
    </font>
    3. Under the same assumption, what is the shape of the element in ```hidden```? (2 points)
    <font color=blue> 
    (# batches, # datapoints, 10)
    </font>
    4. under the same assumption, what is the relationship between ```x2[:, -1, :]``` and ```hidden[0]```? (4 points)
    <font color=blue> 
    x2[:, -1, :] == hidden[0]
    </font>

* For log softmax layer, what is the meaning of ```dim=-1```? What is each dimension's meaning of the output? (5 points)
<font color=blue> 
Select the last dim, ie the columns that are your features and not the different datapoints in a batch.
</font>



#### One step before train a DNN model
Before start train a model, we still need to decide the loss function and the optimization function. 
Similar to the example of build dnn architecture, read the code carefully and answer some questions after you read the example code.

```
(1) simple_model = exampleRNN()
(2) loss_fn = torch.nn.NLLLoss()
(3) optimizer = torch.optimizer.Adam(simple_model.parameters(), lr=1e-2)
```
1. Which line initializes the DNN model?(1 points)
<font color=blue> 
(1) simple_model = exampleRNN()
</font>
2. Which line specifies the loss function? (1 points)
<font color=blue> 
(2) loss_fn = torch.nn.NLLLoss()
</font>
3. Which line specifies the optimization function? And what is the meaning of each parameter? (5 points)
<font color=blue> 
(3) optimizer = torch.optimizer.Adam(simple_model.parameters(), lr=1e-2: the first is the parameters to update on the backward pass, the second is what the graident should be multiplied by when updating (learning rate).
</font>
4. If the loss function is ```torch.nn.CrossEntropyLoss```, is there anything I need modify in the RNN architecture? If so, how should I modify it? (5 points)
<font color=blue> 
You should remove the log softmax layer at the output.
</font>



#### Example1: Train a model for counting task
Here we will put every pieces together to train a model. Make sure that ```youDataSampler()``` and ```padding()``` works fine. We introduced ```padding()``` earlier. Complete the function ```generate_initial_hidden()```. And run the experiment.


In [17]:
class exampleRNN(torch.nn.Module):
    def __init__(self):
        super(exampleRNN, self).__init__()
        self.fc0 = torch.nn.Embedding(num_embeddings=5, embedding_dim=4)
        self.rnn_layer = torch.nn.LSTM(input_size=4, hidden_size=10, num_layers=1, bidirectional=False, batch_first=True)
        self.dropout = torch.nn.Dropout(0.4)
        self.fc1 = torch.nn.Linear(10, 5)
        self.softmax = torch.nn.LogSoftmax(dim=-1)

    def forward(self, x, hidden):
        x = self.fc0(x)
        x2, hidden = self.rnn_layer(x, hidden)
        x3 = self.dropout(x2)
        x4 = self.fc1(x3)
        x5 = self.softmax(x4)
        return x5, hidden


def data_generation():
    np.random.seed(0)
    data = []
    for i in range(10000):
        length = i % 10 + 1
        seq = np.zeros(length*2+3, dtype=int)

        seq[0] = 4
        seq[1:length+1] = 1 
        seq[length+1] = 3
        seq[length+2:-1] = 2
        seq[-1] = 0

        input = copy.deepcopy(seq[:-1]).astype(np.int64)
        output = copy.deepcopy(seq[1:]).astype(np.int64)
        data.append((input, output))

    return np.array(data)



In [18]:
# visualize the data
data = data_generation()
data[8]

array([array([4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2],
      dtype=int64),
       array([1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0],
      dtype=int64)], dtype=object)

In [19]:
# Most of the code are ready
# BUT, there are still several line of code need your work!!

def generate_initial_hidden(batch_size, device="cpu"):
    hidden = []
    # Both h_0 and c_0 should be initialized as zero tensors
    hidden = [torch.zeros((1, batch_size, 10)).to(device), 
              torch.zeros((1, batch_size, 10)).to(device)]
    return hidden

def output_transpose(output):
    output_hat = None
    output_hat = output.transpose(1, 2)
    return output_hat

def char_generation(model, seq, max_len=100, device="cpu"):
    model.eval()
    # seq = np.eye(5, dtype=np.float32)[seq.reshape(-1)]
    input_seq = torch.from_numpy(seq)
    input_seq = input_seq.unsqueeze(0).to(device)
    hidden = [torch.zeros([1, 1, 10]).to(device), 
              torch.zeros([1, 1, 10]).to(device)]
    
    counting = []
    input_char = input_seq
    for i in range(max_len):
        new_char_prob, hidden = model(input_char.to(device), hidden)
        new_char_prob = new_char_prob[:, -1]
        new_char = torch.distributions.categorical.Categorical(logits=new_char_prob).sample() # output is a Long tensor
        new_char = new_char.item() # transform to numpy 
        input_char = torch.zeros(1, 1).type_as(input_char).fill_(new_char)
        counting.append(new_char)
        if new_char == 0:
            return counting
    return counting

In [20]:
simple_model = exampleRNN()
loss_fn = torch.nn.NLLLoss()
optimizer = torch.optim.Adam(simple_model.parameters(), lr=1e-3)

msk = np.random.rand(len(data)) < 0.8
train_data = data[msk]
val_data = data[~msk]

train_dataloader = DataLoader(yourDataset(train_data), batch_size=10, shuffle=True, collate_fn=padding)
val_dataloader = DataLoader(yourDataset(val_data), batch_size=10, shuffle=True, collate_fn=padding)

# if you want change to GPU mode, you just need to let device = "cuda:0" where 0 indicate the first gpu in your computer.
device='cuda' 
simple_model.to(device)
for i in range(20):
    # Sets this simple_model in training mode. Just need one function
    simple_model.train()

    for index, batch in enumerate(train_dataloader):
        input = batch["input"].to(device)
        output = batch["output"].to(device)
        hidden = generate_initial_hidden(batch_size=input.shape[0], device=device)

        simple_model.zero_grad()

        output_hat, _ = simple_model(input, hidden)
        output_hat = output_transpose(output_hat)
        # compute loss
        train_loss = loss_fn(output_hat, output)
        # compute gradient
        train_loss.backward()
        # weights update
        optimizer.step()

    val_loss = []
    # Set the simple_model in evaluation mode. Just need one function.
    simple_model.eval()
    
    for index, batch in enumerate(val_dataloader):
        input = batch["input"].to(device)
        output = batch["output"].to(device)
        hidden = generate_initial_hidden(batch_size=input.shape[0], device=device)

        output_hat, _ = simple_model(input, hidden)
        output_hat = output_transpose(output_hat)
        loss = loss_fn(output_hat, output)
        val_loss.append(loss.item())
        
    print("train loss:", train_loss.item())
    print("eval loss:", np.mean(val_loss))

train loss: 0.23022131621837616
eval loss: 0.16291641082727548
train loss: 0.1874101310968399
eval loss: 0.12775074054646973
train loss: 0.14346323907375336
eval loss: 0.12377344952388243
train loss: 0.1373668760061264
eval loss: 0.11923109977082773
train loss: 0.14196111261844635
eval loss: 0.11707423367735112
train loss: 0.13310842216014862
eval loss: 0.11554426654721751
train loss: 0.12435323745012283
eval loss: 0.11498771009571625
train loss: 0.15274640917778015
eval loss: 0.11442543509783167
train loss: 0.1417730301618576
eval loss: 0.1129363789928682
train loss: 0.12454812973737717
eval loss: 0.11529648808216808
train loss: 0.11909613013267517
eval loss: 0.11313263748330299
train loss: 0.16317789256572723
eval loss: 0.11310458822984888
train loss: 0.1393606811761856
eval loss: 0.1130279762049516
train loss: 0.12153904139995575
eval loss: 0.11255208846896586
train loss: 0.14302580058574677
eval loss: 0.11281618585038666
train loss: 0.12373407930135727
eval loss: 0.1125222338043680

In [21]:
# test, case 1: when n less than 10, which is the maximum length in our training examples
device='cuda'
print(char_generation(simple_model, np.array([4, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 1, 1, 3]), device=device))

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.IntTensor instead (while checking arguments for embedding)

In [18]:
# test, case 2: when n is larger than 10.
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]), device=device))
print(char_generation(simple_model, np.array([4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]), device=device))

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.IntTensor instead (while checking arguments for embedding)

#### A useful visualization tool
We just finish the first training task. In this task, we just print out the training loss and evaluation loss. What if we want something else? For example, the output of a specific input, or the curve of training loss while the model is still training. Instead of writting by ourself, we can also use Tensorboard.
Tensorborad is a visualization tool designed by Google. Even though Google only offers example on using tensorflow to interact with tensorboard. Tensorboard can be used by Pytorch or any other deep learning platform. 

Pytorch offers a group of API to saving outputs as a tensorborad readable format. Next cell is an intuitive example of how to draw scaler and output text. 


In [22]:

from torch.utils.tensorboard import SummaryWriter
import numpy as np
import time

# it seems there is some problem on delete directory in colab
# you need to de
if os.path.isdir("runs"):
    shutil.rmtree("runs")

# create a tensorboard writer instance. 
exp_writer = SummaryWriter(flush_secs=10)

%tensorboard --logdir runs
for n_iter in range(100):
    exp_writer.add_scalar('Loss/train', np.random.random(), n_iter)
    exp_writer.add_scalar('Loss/test', np.random.random(), n_iter)
    exp_writer.add_text("train", "{}th iter: hello \n\n".format(n_iter), global_step=n_iter)
exp_writer.close()

Reusing TensorBoard on port 6006 (pid 2352), started 1:08:52 ago. (Use '!kill 2352' to kill it.)

In [23]:
def result_recording(tensorboard_writer, train_loass, test_loss, n_iter):
    tensorboard_writer.add_scalar('Loss/train', train_loss, n_iter)
    tensorboard_writer.add_scalar('Loss/test', test_loss, n_iter)

#### Saving and Loading a trained model
The example in the next cell is how to save and load a trained model's weights. [Here](https://pytorch.org/tutorials/beginner/saving_loading_models.html) offers very detail explaination about how pytorch saves and loads model. 

In [24]:
# saving a trained model's weights to a given path
torch.save(simple_model.state_dict(), "./simple_model.gpu")

# load a trained model's weights from given path. Be Careful about the weights' type difference. 
# As we already seen that cuda_tensor and cpu_tensor are not compatable.
simple_model2 = exampleRNN()
simple_model2.load_state_dict(torch.load("./simple_model.gpu", map_location="cpu"))

<All keys matched successfully>

#### Another task: copy

Counting is a relatively easy task because the pattern is streatforward: except the start symbol and end symbol, $1$s only appears before the delimiter while $2$s appears after the delimiter all the time. The result looks good even without any hyperparameter tunning.

Now, let's try a more complex task: copy. The model takes a sequence, which is a random combination of $1$s and $2$s, as input and it should output the exact same sequence. For example, if the input is "4121223", the output should be "121220", where 4 is the start symble, 0 is the end symble and 3 is the delimiter. Comparing with counting, LSTM needs to learn a more complex pattern in copy. As a consequence, more data is required to help the model summarize the pattern. 

Since we alreay have some experience on the whole pipeline in previous exercise. This time, we will focus on reorganizing the code to make it easier to tune the hyperparameter. After all, it is unlikely that we can get a good set of hyperparameters with first guess. In the next cell, you are required to implement some functions.

##### Complete the code (15 points, 5 for each)

In [76]:
class CharRNN(nn.Module):
    def __init__(self, input_size, embedding_size, lstm_hidden_size, lstm_num_layers, lstm_bidirectional, lstm_dropout, last_dropout):
        '''
        Instead of set fixed hyperparamters, we may want change some hyper parameter while not change any code in the class.
        I already define the input parameters. You are responsible for create DNN layers based on them.
        '''
        super(CharRNN, self).__init__()
        assert isinstance(lstm_bidirectional, bool)
        self.input_size = input_size
        self.embedding_size = embedding_size
        self.lstm_hidden_size = lstm_hidden_size
        self.lstm_num_layers = lstm_num_layers
        self.lstm_bidirectional = lstm_bidirectional
        self.lstm_dropout = lstm_dropout
        self.last_dropout = last_dropout

        #################### your code start ############################
        self.embedding = nn.Embedding(input_size, embedding_size)
        self.lstm = nn.LSTM(input_size=embedding_size, 
                            hidden_size=lstm_hidden_size, 
                            num_layers=lstm_num_layers,
                            bidirectional=lstm_bidirectional,
                            dropout=lstm_dropout)
        self.fc = nn.Linear(lstm_hidden_size, input_size) 
        self.dropout = nn.Dropout()
        #################### your code ends ############################

        self.softmax = nn.LogSoftmax(dim=-1) 


    def forward(self, input, initial_hidden):
        
        print("input: ", input.shape)
        print("init hidden: ", initial_hidden)
        
        x = self.embedding(input)
        
        print("x: ", x.shape)
        
        
        x, hidden = self.lstm(x, initial_hidden)

        print("x: ", x.shape)
        print("hidden: ", hidden.shape)
        
        outputs = x
        outputs = self.dropout(outputs)
        outputs = self.fc(outputs)
        outputs = self.softmax(outputs)

        y = outputs 
        
        return y, hidden


class dnn_operations(object):
    def __init__(self, model, loss_fn, optimizer):
        self.model = model
        self.loss_fn = loss_fn
        self.optimizer = optimizer
        

    def train(self, dataloader, device="cpu"):
        ################your code start ########################
        # This function is based on counting task's training process.
        # takes the training dataset as input. 
        # Train the model.
        # And, return average loss overall samples in the dataset
        # DON'T forget set the model to training mode. 


        self.model.train()

        for index, batch in enumerate(dataloader):
            input = batch["input"].to(device)
            output = batch["output"].to(device)
            hidden = generate_initial_hidden(batch_size=input.shape[0], device=device)
            hidden = np.array(hidden)
            self.model.zero_grad()

            output_hat, _ = self.model(input, hidden)
            output_hat = output_transpose(output_hat)
            # compute loss
            train_loss = self.loss_fn(output_hat, output)
            # compute gradient
            train_loss.backward()
            # weights update
            self.optimizer.step()
            
        return self.train_loss.item()



        ##################your code end ######################
        return loss.item()

    def eval(self, dataloader, device="cpu"):
        ################ your code start ########################
        # This function is based on counting task's evaluation process.
        # takes the evaluation dataset as input. 
        # Return the average loss overall samples in the dataset
        # DON'T forget set the model to evaluation mode. 

        val_loss = []
        # Set the simple_model in evaluation mode. Just need one function.
        self.model.eval()

        for index, batch in enumerate(val_dataloader):
            input = batch["input"].to(device)
            output = batch["output"].to(device)
            hidden = generate_initial_hidden(batch_size=input.shape[0], device=device)

            output_hat, _ = self.model(input, hidden)
            output_hat = output_transpose(output_hat)
            loss = self.loss_fn(output_hat, output)
            val_loss.append(loss.item())
        
        ##################your code end ######################
        return np.mean(val_loss)

    def generate_initial_hidden(self, batch_size, device):

        initial_hidden = [torch.zeros([self.model.lstm_num_layers * (int(self.model.lstm_bidirectional) + 1), batch_size, self.model.lstm_hidden_size]).to(device), 
                          torch.zeros([self.model.lstm_num_layers * (int(self.model.lstm_bidirectional) + 1), batch_size, self.model.lstm_hidden_size]).to(device)]
        return np.array(initial_hidden)

    def char_generation(self, seq, max_len=100, device="cpu"):
        self.model.eval()
        input_seq = torch.from_numpy(seq)
        input_seq = input_seq.unsqueeze(0).to(device)
        hidden = self.generate_initial_hidden(1, device=device)

        outputs = []
        input_char = input_seq
        for i in range(max_len):
            new_char_prob, hidden = self.model(input_char.to(device), hidden)
            new_char_prob = new_char_prob[:, -1]
            new_char = torch.distributions.categorical.Categorical(logits=new_char_prob).sample() # output is a Long tensor
            new_char = new_char.item() # transform to numpy 
            input_char = torch.zeros(1, 1).type_as(input_char).fill_(new_char)
            outputs.append(new_char)

            if new_char == 0:
                return outputs
        return outputs
        

In [77]:
def data_generation_forcoping(num_samples=10000):
    np.random.seed(0)
    data = []
    for i in range(num_samples):
        length = i % 10 + 1
        seq = np.zeros(length*2+3, dtype=int)

        temp = np.random.randint(1,3, length)
        seq[0] = 4
        seq[1:length+1] = copy.deepcopy(temp)
        seq[length+1] = 3
        seq[length+2:-1] = copy.deepcopy(temp)
        seq[-1] = 0

        input = copy.deepcopy(seq[:-1]).astype(np.int64)
        output = copy.deepcopy(seq[1:]).astype(np.int64)
        data.append((input, output))

    return np.array(data)

In [78]:
data_coping = data_generation_forcoping(num_samples=100000)
data_coping[109]

array([array([4, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 3, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2],
      dtype=int64),
       array([1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 3, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 0],
      dtype=int64)], dtype=object)

In [79]:
# model initialization
simple_model_coping = CharRNN(input_size=5, 
                              embedding_size=4, 
                              lstm_hidden_size=14, 
                              lstm_num_layers=1, 
                              lstm_bidirectional=False, 
                              lstm_dropout=0, 
                              last_dropout=0.5)

loss_fn_coping = torch.nn.NLLLoss()
optimizer_coping = torch.optim.Adam(simple_model_coping.parameters(), lr=1e-3)

# if you want change to GPU mode, you just need to let device = "cuda:0" where 0 indicate the first gpu in your computer.
device='cuda' 
simple_model_coping.to(device)

operations_for_copy = dnn_operations(model=simple_model_coping, loss_fn=loss_fn_coping, optimizer=optimizer_coping)

# loading data
msk_coping = np.random.rand(len(data_coping)) < 0.8
train_data_coping = data_coping[msk_coping]
val_data_coping = data_coping[~msk_coping]

train_dataloader = DataLoader(yourDataset(train_data_coping), batch_size=20, shuffle=True, collate_fn=padding)
val_dataloader = DataLoader(yourDataset(val_data_coping), batch_size=20, shuffle=True, collate_fn=padding)


# create a tensorboard writer instance. 
#writer_coping = SummaryWriter(flush_secs=10)

#%tensorboard --logdir runs
    
for i in range(20):
    print(i)
    train_loss = operations_for_copy.train(dataloader=train_dataloader, device=device)
    eval_loss = operations_for_copy.eval(dataloader=train_dataloader, device=device)
    #result_recording(writer_coping, train_loss, eval_loss, i)


0
input:  torch.Size([20, 22])
init hidden:  [tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0.

RuntimeError: Expected hidden[0] size (1, 22, 14), got (1, 20, 10)

In [None]:
# all the mode
print(operations_for_copy.char_generation(np.array([4, 1, 2, 2, 1, 2, 1, 3]), device=device))
print(operations_for_copy.char_generation(np.array([4, 2, 1, 2, 2, 2, 2, 1, 3]), device=device))
print(operations_for_copy.char_generation(np.array([4, 2, 2, 1, 2, 1, 2, 2, 1, 3]), device=device))


In [None]:
# sequence longer than 10
print(operations_for_copy.char_generation(np.array([2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 3]), device=device))
print(operations_for_copy.char_generation(np.array([2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 3]), device=device))
print(operations_for_copy.char_generation(np.array([2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 3]), device=device))
print(operations_for_copy.char_generation(np.array([2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 2, 3]), device=device))

##### Some extra experiments (15 points)
<font color=red>
Copy the experiment to new cells, but change the hyperparameters of the model. You can changes the lstm hidden size to be 10 or 20. OR, you can use more than one lstm layers. OR, some other hyperparameters you find interesting. 
Report your findings. Hopefully, this experiment should offer you some experience on tuning the hyperparameters.
</font>


<font color=blue>
Your report
</font>

## CharRNN on generation (15 points in total)

### CharRNN
Basically, CharRNN is just a regular recurrent neural network. One can use any rnn layer to build this network, e.g. GRU, LSTM. The word "Char" indicates that the input unit is a single character, like 'a', '\n', '$\alpha$'. Using it for code generation is a perfect example to show how LSTM can learn some long term dependencies while it does not simply memorise every thing. There are other cool experiments about CharRNN in [here](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). 

### Task (15 points)
This section is more focus on experiments. I will offer you a compelet example on training a CharRNN on C++ code generation. Your task is train a model with some other data. You can modify the code that I offers. Tune the hyperparameters might be a necessory step. [Here are some interesting data you can try](https://cs.stanford.edu/people/karpathy/char-rnn/). This is also where I download the linux code. Working on some small size data is highly recommended.

Report your training result. Can you find some clue that can distinguish the real text and generated text?
<font color=blue>
your report
</font>

### Prepare Loading Dataset

In [None]:
!mkdir linux_code
!wget https://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt -O linux_code/linux_input.txt
!ls linux_code
filepath="linux_code"

A subdirectory or file linux_code already exists.


In [None]:
def generate_data(dirpath, segment_length=36):
    """
    this function will load the text file 
    and generate text segments with given length limitation
   
    parameters:
        dirpath: path to the text file
        segment_length: length of the text segment
    Outputs:
        1. A numpy array of tuples. In a tuple, the first element contains 
           the first segment_length-1 of characters, the last contains the 
           last segment_length-1 of characters. Like this example (segment_length=100):   
           ('/*\n * linux/kernel/irq/autoprobe.c\n *\n * Copyright (C) 1992, 1998-2004 Linus Torvalds, Ingo Molnar\n',
            '*\n * linux/kernel/irq/autoprobe.c\n *\n * Copyright (C) 1992, 1998-2004 Linus Torvalds, Ingo Molnar\n ')
        2. the dictionary for mapping each character to a unique index. 
        3. the reversed dictionary that mapping the index to character
    """

    data = None
    char2index = None
    index2char = None
    filenames = []
    for _,_,fs in os.walk(dirpath):
        filenames.extend(fs)
    
    data = []
    chars = set([])
    for fname in filenames:
        with open(os.path.join(dirpath, fname), 'r', encoding="utf8", errors='ignore') as f:
            text = f.read(segment_length)
            while text:
                input = text[:-1]
                output = text[1:]
                data.append((input, output))
                chars |= set(text)
                text = f.read(segment_length)

    data = np.array(data)
    char_list = sorted(list(chars))
    char2index = {char:index for index, char in enumerate(char_list)}
    index2char = {v:k  for k, v in char2index.items()}
    assert isinstance(data, np.ndarray)
    for i, c in index2char.items():
        assert char2index[c] == i 
    return data, char2index, index2char

In [None]:
# visualize your output. Make sure the output is correct
data, char2index, index2char = generate_data(filepath, segment_length=100)
data[110]

In [None]:
class TextDataset(Dataset):
    def __init__(self, data, char2index):
        """
        Input parameters
            data: a list of tuple. The first element in a tuple should be the input for the DNN, the second one should be the output
            char2index: map characters to index
        """
        self.data = data
        self.char2index = char2index

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        instance = self.data[idx]
        input = None
        output = None
        # transform selected text into the index list

        input  = np.array([self.char2index[i] for i in instance[0]], dtype=np.int64) 
        output = np.array([self.char2index[i] for i in instance[1]], dtype=np.int64)
        
        assert isinstance(input, np.ndarray)
        assert isinstance(output, np.ndarray)
        sample = {'input' : input, 
                  'output': output}
        return sample
        

In [None]:
# visualize your output. Make sure the output is correct
test_text = TextDataset(data=data, char2index=char2index)
test_text.__getitem__(110)

In [None]:

code_dataset = TextDataset(data, char2index=char2index)
# we can directly reuse the padding function which is implemented in last section.
dataloader = DataLoader(code_dataset, batch_size=2,
                        shuffle=True, num_workers=1,
                        collate_fn=padding)
for index, batch in enumerate(dataloader):
    print(batch["input"])
    print(batch["output"])
    break

### Build the neural net architecture





In [None]:
def text_generation(model, start_seq, char2index, index2char, max_len=100):
    model.eval()
    input_seq = np.array([char2index[i] for i in start_seq])
    input_seq = torch.from_numpy(input_seq).long().to(device)
    input_seq = torch.unsqueeze(input_seq, 0)
    output_seq = copy.deepcopy(input_seq)
    batch_size = input_seq.shape[0]
    hidden = [torch.zeros([num_layers * (int(bidirectional) + 1), batch_size, hidden_size]).to(device), 
              torch.zeros([num_layers * (int(bidirectional) + 1), batch_size, hidden_size]).to(device)]

    for i in range(max_len):
        new_char_prob, hidden = model(input_seq, hidden)
        new_char_prob = new_char_prob[:, -1]
        new_char = torch.distributions.categorical.Categorical(logits=new_char_prob).sample() # output is a Long tensor
        new_char = new_char.item() # transform to numpy 
        input_seq = torch.ones(1, 1).type_as(input_seq).fill_(new_char)
        output_seq = torch.cat([output_seq, input_seq], dim=1)
    
    generate_text = "".join([index2char[i] for i in output_seq.cpu().numpy()[0]])

    assert type(generate_text) is str
    return generate_text

def record_generate_text(tensorflow_writer, epoch, model, start_seq, char2index, index2char, max_len=100):
    generate_text = text_generation(model, start_seq, char2index, index2char, max_len)
    tensorflow_writer.add_text("{}th epoch".format(epoch),
                              "".join([i if i != "/n" else "/n/n" for i in generate_text]))

In [None]:
num_layers = 3
hidden_size=512
embedding_size=256
bidirectional = False
dropout=0.7
epochs = 50
test_freq = 1000
batch_size=64

model_code = CharRNN(input_size=len(char2index), 
                     embedding_size=embedding_size, 
                     lstm_hidden_size=hidden_size, 
                     lstm_num_layers=num_layers, 
                     lstm_bidirectional=bidirectional, 
                     lstm_dropout=dropout, 
                     last_dropout=0.5)

critiria = nn.NLLLoss()
optimizer = torch.optim.Adam(model_code.parameters(), lr=1e-3)


In [None]:
# create a tensorboard writer instance. 
writer_code = SummaryWriter(flush_secs=10)

# 
data, char2index, index2char = generate_data(filepath, segment_length=100)
msk = np.random.rand(len(data)) < 0.8
train = data[msk]
validation = data[~msk]

train_set      = TextDataset(train, char2index=char2index)
validation_set = TextDataset(validation, char2index=char2index)

train_loader = DataLoader(train_set, batch_size=batch_size,
                          shuffle=True, num_workers=6, 
                          collate_fn=padding)

validation_loader = DataLoader(validation_set, batch_size=batch_size,
                               shuffle=True, num_workers=6, 
                               collate_fn=padding)
    
# if you want change to GPU mode, you just need to let device = "cuda:0" where 0 indicate the first gpu in your computer.
device='cuda' 
model_code.to(device)
operations_for_code = dnn_operations(model=model_code, loss_fn=critiria, optimizer=optimizer)
for epoch in range(20):
    train_loss = operations_for_code.train(dataloader=train_loader, device=device)
    eval_loss = operations_for_code.eval(dataloader=validation_loader, device=device)
    result_recording(writer_code, train_loss, eval_loss, epoch)
    record_generate_text(writer_code, epoch, operations_for_code.model, "#include", char2index, index2char, max_len=400)

In [None]:
generate_text = text_generation(operations_for_code.model, "#include", char2index, index2char, max_len=10000)
print("".join([i if i != "/n" else "/n/n" for i in generate_text]))