<center><h2>ALTeGraD 2022<br>Lab Session 1: HAN</h2><h3>Hierarchical Attention Network Using GRU</h3> 27 / 10 / 2022<br> M. Kamal Eddine, H. Abdine<br><br>


<b>Student name:</b> Tom SALEMBIEN


</center>
In this lab, you will get familiar with recurrent neural networks (RNNs), self-attention, and the HAN architecture <b>(Yang et al. 2016)</b> using PyTorch. In this architecture, sentence embeddings are first individually produced, and a document embedding is then computed from the sentence embeddings.<br>
<b>The deadline for this lab is November 14, 2022 11:59 PM.</b> More details about the submission and the architecture for this lab can be found in the handout PDF.


### = = = = =  Attention Layer = = = = =
In this section, you will fill the gaps in the code to implement the self-attention layer. This layer will be used later to define the HAN architecture. The basic idea behind attention is that rather than considering the last annotation $h_T$ as a summary of the entire sequence, which is prone to information loss, the annotations at <i>all</i> time steps are used.
The self-attention mechanism computes a weighted sum of the annotations, where the weights are determined by trainable parameters. Refer to <b>section 2.2</b> in the handout for the theoretical part, it will be needed to finish the first task.

#### <b>Task 1:</b>

In [1]:
!nvidia-smi

Mon Dec 26 23:57:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    36W / 250W |    757MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   35C    P0    42W / 250W |  29925MiB / 32510MiB |      0%      Default |
|       

In [2]:
import torch
from torch import nn
from torch.utils.data import DataLoader
import torch.nn.functional as F

class AttentionWithContext(nn.Module):
    """
    Follows the work of Yang et al. [https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf]
    "Hierarchical Attention Networks for Document Classification"
    by using a context vector to assist the attention
    # Input shape
        3D tensor with shape: `(samples, steps, features)`.
    # Output shape
        2D tensor with shape: `(samples, features)`.
    """
    
    def __init__(self, input_shape, return_coefficients=False, bias=True):
        super(AttentionWithContext, self).__init__()
        self.return_coefficients = return_coefficients

        self.W = nn.Linear(input_shape, input_shape, bias=bias)
        self.tanh = nn.Tanh()
        self.u = nn.Linear(input_shape, 1, bias=False)

        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.W.weight.data.uniform_(-initrange, initrange)
        self.W.bias.data.uniform_(-initrange, initrange)
        self.u.weight.data.uniform_(-initrange, initrange)
    
    def generate_square_subsequent_mask(self, sz):
        # do not pass the mask to the next layers
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = (
            mask.float()
            .masked_fill(mask == 0, float("-inf"))
            .masked_fill(mask == 1, float(0.0))
        )
        return mask
    
    def forward(self, x, mask=None):
        #print("AttentionWithContext in :", x.size())
        uit = self.W(x)
        uit = self.tanh(uit)
        ait = self.u(uit)
        a = torch.exp(ait)
        
        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            a = a*mask.double()
        
        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        eps = 1e-9
        a = a / (torch.sum(a, axis=1, keepdim=True) + eps)
        weighted_input = torch.sum(a*x, dim=1)
        #print("AttentionWithContext out :", weighted_input.size(), torch.sum(weighted_input, dim=1).size())
        if self.return_coefficients:
            return  [weighted_input, a] 
        else:
            return  weighted_input

### = = = = = Parameters = = = = =
In this section, we define the parameters to use in our training. Such as data path, the embedding dimention <b>d</b>, the GRU layer dimensionality <b>n_units</b>, etc..<br>
The parameter <b>device</b> is used to train the model on GPU if it is available. for this purpose, if you are using Google Colab, switch your runtime to a GPU runtime to train the model with a maximum speed.<br>
<b>Bonus question:</b> What is the purpose of the parameter <i>my_patience</i>?

<i>my_patience</i> parameter is used for early stopping, a regularization method to avoid overfitting. We stop the training if the performance of the model on the validation dataset starts to degrade.

In [3]:
import sys
import json
import operator
import numpy as np

path_root = ''
path_to_data = path_root + 'data/'

d = 20 # dimensionality of amino acid embeddings
n_units = 100 # RNN layer dimensionality
drop_rate = 0.3 # dropout
#input_size = (4888, 989, 20)
input_size = (4888, 8466)
padding_idx = 0
oov_idx = 1
batch_size = 32
nb_epochs = 10
my_patience = 2 # for early stopping strategy
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device :", device)

Device : cuda


### = = = = = Data Loading = = = = =
In this section we will use first <b>wget</b> to download the data the we will load it using numpy in the first cell. While in the second cell, we will use these data to define our Pytorch data loader. Note that the data is already preprocessed, tokenized and padded.<br><br>
<b>Note: if you are running your notebook on Windows or on MacOS, <i>wget</i> will probably not work if you did not install it manually. In this case, use the provided link to download the data and change the <i>path_to_data</i> in the <i>Parameters</i> section accordingly. Otherwise, you will face no problem on Ubuntu and Google Colab.</b>

#### <b>Task 2.1:</b>

In [3]:
# Load files for ohe
graph_indicator = np.loadtxt("graph_indicator.txt", dtype=np.int64)
nodes = np.loadtxt("node_attributes.txt", delimiter=",")

In [4]:
# Read sequences
sequences = list()
with open('sequences.txt', 'r') as f:
    for line in f:
        sequences.append(line[:-1])

# Split data into training and test sets
sequences_train = list()
sequences_test = list()
train_ohe = list()
test_ohe = list()
proteins_test = list()
y_train = list()
with open('graph_labels.txt', 'r') as f:
    for i,line in enumerate(f):
        t = line.split(',')
        ohe_vec = torch.Tensor([node[3:23] for node in nodes[np.where(graph_indicator==i)]])
        if len(t[1][:-1]) == 0:
            proteins_test.append(t[0])
            sequences_test.append(sequences[i])
            test_ohe.append(ohe_vec)
            
        else:
            sequences_train.append(sequences[i])
            y_train.append(int(t[1][:-1]))
            train_ohe.append(ohe_vec)



train_ohe = nn.utils.rnn.pad_sequence(train_ohe).permute(1, 0, 2).long()
test_ohe = nn.utils.rnn.pad_sequence(test_ohe).permute(1, 0, 2).long()
pad_ = (0, 0, 0, 79)
test_ohe = F.pad(test_ohe, pad_, "constant", 0)
y_train = F.one_hot(torch.Tensor(y_train).long())

  ohe_vec = torch.Tensor([node[3:23] for node in nodes[np.where(graph_indicator==i)]])


In [5]:
print(train_ohe.size())
print(test_ohe.size())
print(y_train.size())

torch.Size([4888, 989, 20])
torch.Size([1223, 989, 20])
torch.Size([4888, 18])


In [40]:
print(len(np.unique([x[:9] for x in sequences_train])))

4686


In [8]:
amino_acids = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']

def create_dicts(sequence=amino_acids):
    """
    Create the dicts for the sequence embedding
    """
    word_to_index = dict(zip(sequence, range(1,21)))
    # invert mapping
    index_to_word =  {v : k for k, v in word_to_index.items()}
    return word_to_index, index_to_word

word_to_index, index_to_word = create_dicts()

In [27]:
import csv
import numpy as np
import scipy.sparse as sp
from sklearn.metrics import accuracy_score, log_loss
from sklearn.feature_extraction.text import TfidfVectorizer

def sparse_mx_to_torch_sparse_tensor(sparse_mx):
    """
    Function that converts a Scipy sparse matrix to a sparse Torch tensor
    """
    sparse_mx = sparse_mx.tocoo().astype(np.int_)
    indices = torch.from_numpy(np.vstack((sparse_mx.row, sparse_mx.col)).astype(np.int64))
    values = torch.from_numpy(sparse_mx.data)
    shape = torch.Size(sparse_mx.shape)
    return torch.sparse.FloatTensor(indices, values, shape)


# Read sequences
sequences = list()
with open('sequences.txt', 'r') as f:
    for line in f:
        sequences.append(line[:-1])

# Split data into training and test sets
sequences_train = list()
sequences_test = list()
proteins_test = list()
y_train = list()
with open('graph_labels.txt', 'r') as f:
    for i,line in enumerate(f):
        t = line.split(',')
        if len(t[1][:-1]) == 0:
            proteins_test.append(t[0])
            sequences_test.append(sequences[i])
        else:
            sequences_train.append(sequences[i])
            y_train.append(int(t[1][:-1]))

# Map sequences to 
vec = TfidfVectorizer(analyzer='char', ngram_range=(1, 3))
X_train = vec.fit_transform(sequences_train).todense()
X_test = vec.transform(sequences_test).todense()
y_train = F.one_hot(torch.Tensor(y_train).long())
print(X_train.shape, X_test.shape, y_train.size())

(4888, 8466) (1223, 8466) torch.Size([4888, 18])


In [None]:
### OHE TESTING WITHOUT FILES NODE ATRIBUTE AND GRAPH_INDICATOR ########

# def encode_pad(dataset, word_to_index = word_to_index):
#     """
#     Encoding and padding of the amino acids
#     """
#     encode = []
#     for row in dataset:
#         row_encode = []
#         for aa in row:
#             row_encode.append(word_to_index.get(aa))
#         encode.append(torch.Tensor(row_encode))
    
#     # Padding
#     encode.append(torch.ones(989))
#     encode = nn.utils.rnn.pad_sequence(encode)
#     encode = torch.transpose(encode, 0, 1)
#     return encode[:-1]

# train_encode = encode_pad(sequences_train).long()
# test_encode = encode_pad(sequences_test).long()

# print("Size : nbreview*nb_sentence*sent_size")
# sizes = [train_encode, y_train, test_encode, proteins_test]
# for x in sizes:
#     print(np.shape(x), np.shape(x[0]))
    
# train_ohe = F.one_hot(train_encode)
# test_ohe = F.one_hot(test_encode)

# print(train_ohe.size(), test_ohe.size())

In [16]:
import numpy
import torch
from torch.utils.data import DataLoader, Dataset


class Dataset_(Dataset):
    def __init__(self, x, y):
        self.documents = x
        self.labels = y

    def __len__(self):
        return len(self.documents)

    def __getitem__(self, index):
        document = self.documents[index]
        label = self.labels[index] 
        sample = {
            "document": torch.tensor(document),
            "label": torch.tensor(label),
            }
        return sample


def get_loader(x, y, batch_size=32):
    dataset = Dataset_(x, y)
    data_loader = DataLoader(dataset=dataset,
                            batch_size=batch_size,
                            shuffle=True,
                            drop_last=True,
                            )
    return data_loader

### = = = = = Defining Architecture = = = = =
In this section, we define the HAN architecture. We start with <i>AttentionBiGRU</i> module in order to define the sentence encoder (check Figure 3 in the handout). Then, we define the <i>TimeDistributed</i> module to allow us to forward our input (batch of document) as to the sentence encoder as <b>batch of sentences</b>, where each sentence in the document will be considered as a time step. This module also reshape the output to a batch of timesteps representations per document. Finally we define the <b>HAN</b> architecture using <i>TimeDistributed</i>, <i>AttentionWithContext</i> and <i>GRU</i>.

#### <b>Task 2.2:</b>

In [31]:

class AttentionBiGRU(nn.Module):
    def __init__(self, input_shape, n_units, index_to_word, dropout=drop_rate):
        super(AttentionBiGRU, self).__init__()
        self.embedding = nn.Embedding(len(index_to_word)+2,# fill the gap # vocab size
                                      d, # dimensionality of embedding space
                                      padding_idx=0)
        self.dropout = nn.Dropout(dropout)
        self.gru = nn.GRU(input_size=d,
                          hidden_size=n_units,
                          num_layers=1,
                          bias=True,
                          batch_first=True,
                          bidirectional=True)
        self.attention = AttentionWithContext(2*n_units,   # fill the gap # the input shape for the attention layer
                                              return_coefficients=True)


    def forward(self, sent_ints):
        #print("AttentionBiGru in :", sent_ints.size())
        sent_wv = self.embedding(sent_ints)
        #print(sent_wv.size())
        sent_wv_dr = self.dropout(sent_wv)
        sent_wa, _ =  self.gru(sent_wv_dr)# fill the gap # RNN layer
        sent_att_vec, word_att_coeffs = self.attention(sent_wa) # fill the gap # attentional vector for the sent
        sent_att_vec_dr = self.dropout(sent_att_vec)  
        #print("AttentionBiGru out :", sent_att_vec_dr.size())   
        return sent_att_vec_dr, word_att_coeffs

class TimeDistributed(nn.Module):
    def __init__(self, module, batch_first=False):
        super(TimeDistributed, self).__init__()
        self.module = module
        self.batch_first = batch_first

    def forward(self, x):
        if len(x.size()) <= 2:
            return self.module(x)
        # Squash samples and timesteps into a single axis
        x_reshape = x.contiguous().view(-1, x.size(-1))  # (samples * timesteps, input_size) (448, 30)
        #print("Time distributed", x_reshape.size())
        sent_att_vec_dr, word_att_coeffs = self.module(x_reshape)
        # We have to reshape the output
        if self.batch_first:
            sent_att_vec_dr = sent_att_vec_dr.contiguous().view(x.size(0), -1, sent_att_vec_dr.size(-1))  # (samples, timesteps, output_size)
            word_att_coeffs = word_att_coeffs.contiguous().view(x.size(0), -1, word_att_coeffs.size(-1))  # (samples, timesteps, output_size)
        else:
            sent_att_vec_dr = sent_att_vec_dr.view(-1, x.size(1), sent_att_vec_dr.size(-1))  # (timesteps, samples, output_size)
            word_att_coeffs = word_att_coeffs.view(-1, x.size(1), word_att_coeffs.size(-1))  # (timesteps, samples, output_size)
        return sent_att_vec_dr, word_att_coeffs      

class HAN(nn.Module):
    def __init__(self, input_shape, n_units, index_to_word, dropout=0):
        super(HAN, self).__init__()
        self.encoder = AttentionBiGRU(input_shape, n_units, index_to_word, dropout)
        self.timeDistributed = TimeDistributed(self.encoder, True)
        self.dropout = nn.Dropout(drop_rate)
        self.gru = nn.GRU(input_size=2*n_units,# fill the gap # the input shape of GRU layer
                          hidden_size=n_units,
                          num_layers=1,
                          bias=True,
                          batch_first=True,
                          bidirectional=True)
        self.attention = AttentionWithContext(2*n_units, # fill the gap # the input shape of between-sentence attention layer
                                              return_coefficients=True)
        self.lin_out = nn.Linear(2*n_units,   # fill the gap # the input size of the last linear layer
                                 18)
        self.preds = nn.Sigmoid()

    def forward(self, doc_ints):
        #print('HAN to time distrib', doc_ints.size())
        sent_att_vecs_dr, word_att_coeffs = self.timeDistributed(doc_ints.to(device).long())
        #print('Time Distrib to gru', sent_att_vecs_dr.size())
        doc_sa, _ = self.gru(sent_att_vecs_dr)
        #print('GRU to attention', doc_sa.size())
        doc_att_vec, sent_att_coeffs = self.attention(doc_sa)
        #print('Attention to lin_out', doc_att_vec.size())
        doc_att_vec_dr = self.dropout(doc_att_vec)
        doc_att_vec_dr = self.lin_out(doc_att_vec_dr)
        #print("lin_out", doc_att_vec_dr.size())
        return self.preds(doc_att_vec_dr), word_att_coeffs, sent_att_coeffs


### = = = = = Training = = = = =
In this section, we have two code cells. In the first one, we define our evaluation function to compute the training and validation accuracies. While in the second one, we define our model, loss and optimizer and train the model over <i>nb_epochs</i>.<br>
<b>Bonus task:</b> use <a href="https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html" target="_blank">tensorboard</a> to visualize the loss and the validation accuray during the training.

#### <b>Task 2.3:</b>

In [33]:
from tqdm import tqdm

model = HAN(input_size, n_units, index_to_word).to(device)
model = model.double()
lr = 0.001  # learning rate
criterion = nn.CrossEntropyLoss()# fill the gap, use Binary cross entropy from torch.nn: https://pytorch.org/docs/stable/nn.html#loss-functions
optimizer = torch.optim.Adam(model.parameters(), lr=lr) #fill me

def train(x_train=X_train,
          y_train=y_train,
          x_test=X_test,
          word_dict=index_to_word,
          batch_size=batch_size):
  
    train_data = get_loader(x_train, y_train, batch_size)

    best_loss = np.inf
    p = 0 # patience

    for epoch in range(1, nb_epochs + 1): 
        losses = []
        accuracies = []
        with tqdm(train_data, unit="batch") as tepoch:
            for idx, data in enumerate(tepoch):
                tepoch.set_description(f"Epoch {epoch}")
                model.train()
                optimizer.zero_grad()
                input = data['document'].to(device)
                label = data['label'].to(device)
                label = label.double()
                output = model.forward(input)[0]
                loss = criterion(output, label) # fill the gap # compute the loss
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5) # prevent exploding gradient 
                optimizer.step()

                losses.append(loss.item())

        print("===> Epoch {} Complete: Avg. Loss: {:.4f}"
              .format(epoch, sum(losses)/len(losses)))
        train_loss = sum(losses)/len(losses)
        if train_loss <= best_loss:
            best_loss = train_loss
            print("Train Loss improved, saving model...")
            torch.save(model.state_dict(), './best_model.pt')
            p = 0
#         else:
#             p += 1
#             if p==my_patience:
#                 print("Validation accuracy did not improve for {} epochs, stopping training...".format(my_patience))
#     print("Loading best checkpoint...")    
#     model.load_state_dict(torch.load('./best_model.pt'))
#     model.eval()
    print('done.')

train()

  "label": torch.tensor(label),
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████| 152/152 [01:50<00:00,  1.38batch/s]


===> Epoch 1 Complete: Avg. Loss: 2.5846
Train Loss improved, saving model...


Epoch 2: 100%|█████████████████████████████████████████████████████████████████████| 152/152 [01:50<00:00,  1.38batch/s]


===> Epoch 2 Complete: Avg. Loss: 2.5474
Train Loss improved, saving model...


Epoch 3: 100%|█████████████████████████████████████████████████████████████████████| 152/152 [01:50<00:00,  1.38batch/s]


===> Epoch 3 Complete: Avg. Loss: 2.5471
Train Loss improved, saving model...


Epoch 4: 100%|█████████████████████████████████████████████████████████████████████| 152/152 [01:50<00:00,  1.38batch/s]


===> Epoch 4 Complete: Avg. Loss: 2.5477


Epoch 5: 100%|█████████████████████████████████████████████████████████████████████| 152/152 [01:50<00:00,  1.38batch/s]


===> Epoch 5 Complete: Avg. Loss: 2.5467
Train Loss improved, saving model...


Epoch 6: 100%|█████████████████████████████████████████████████████████████████████| 152/152 [01:50<00:00,  1.38batch/s]


===> Epoch 6 Complete: Avg. Loss: 2.5469


Epoch 7: 100%|█████████████████████████████████████████████████████████████████████| 152/152 [01:50<00:00,  1.38batch/s]


===> Epoch 7 Complete: Avg. Loss: 2.5461
Train Loss improved, saving model...


Epoch 8: 100%|█████████████████████████████████████████████████████████████████████| 152/152 [01:50<00:00,  1.38batch/s]


===> Epoch 8 Complete: Avg. Loss: 2.5467


Epoch 9: 100%|█████████████████████████████████████████████████████████████████████| 152/152 [01:50<00:00,  1.38batch/s]


===> Epoch 9 Complete: Avg. Loss: 2.5476


Epoch 10: 100%|████████████████████████████████████████████████████████████████████| 152/152 [01:50<00:00,  1.38batch/s]

===> Epoch 10 Complete: Avg. Loss: 2.5460
Train Loss improved, saving model...
done.





### Test Model


In [34]:
model.load_state_dict(torch.load("./best_model.pt"))
model.eval()

HAN(
  (encoder): AttentionBiGRU(
    (embedding): Embedding(22, 20, padding_idx=0)
    (dropout): Dropout(p=0, inplace=False)
    (gru): GRU(20, 100, batch_first=True, bidirectional=True)
    (attention): AttentionWithContext(
      (W): Linear(in_features=200, out_features=200, bias=True)
      (tanh): Tanh()
      (u): Linear(in_features=200, out_features=1, bias=False)
    )
  )
  (timeDistributed): TimeDistributed(
    (module): AttentionBiGRU(
      (embedding): Embedding(22, 20, padding_idx=0)
      (dropout): Dropout(p=0, inplace=False)
      (gru): GRU(20, 100, batch_first=True, bidirectional=True)
      (attention): AttentionWithContext(
        (W): Linear(in_features=200, out_features=200, bias=True)
        (tanh): Tanh()
        (u): Linear(in_features=200, out_features=1, bias=False)
      )
    )
  )
  (dropout): Dropout(p=0.3, inplace=False)
  (gru): GRU(200, 100, batch_first=True, bidirectional=True)
  (attention): AttentionWithContext(
    (W): Linear(in_features=200

In [38]:
with torch.no_grad():
    dataloader = DataLoader(dataset=X_test,
                            batch_size=1,
                            shuffle=True,
                            pin_memory=True,
                            drop_last=True,
                            )
    y_pred = torch.zeros(1,18).to(device)
    for idx, data in enumerate(dataloader):
        y_pred = torch.cat([y_pred, model(data.to(device))[0]], 0)
    y_pred = y_pred[1:]


In [39]:
import csv
# Write predictions to a file
with open('sample_submission_han.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    lst = list()
    for i in range(18):
        lst.append('class'+str(i))
    lst.insert(0, "name")
    writer.writerow(lst)
    for i, protein in enumerate(proteins_test):
        lst = y_pred[i].tolist()
        lst.insert(0, protein)
        writer.writerow(lst)