## Set-up environment

Let's first install the required libraries:
* HuggingFace Transformers (for the CodeT5 model)
* HuggingFace Datasets (for loading the dataset + preprocessing it)
* PyTorch Lightning (for training)
* Weights and Biases (for logging training metrics).

In [1]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

ModuleNotFoundError: No module named 'google.colab'

In [2]:
!nvcc --version


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


In [3]:
!nvidia-smi


Sat Nov  9 14:24:12 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   28C    P0    57W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [1]:
%cd /content/drive/My Drive/workspace/DeepCom

[Errno 2] No such file or directory: '/content/drive/My Drive/workspace/DeepCom'
/media/lhbac07/DeepCom


  bkms = self.shell.db.get('bookmarks', {})


In [None]:
import nltk

nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
#!unzip '/content/drive/MyDrive/datasetv2.zip' -d '/content/drive/MyDrive/workspace/'

Archive:  /content/drive/MyDrive/datasetv2.zip
   creating: /content/drive/MyDrive/workspace/dataset_v2/
   creating: /content/drive/MyDrive/workspace/dataset_v2/original/
  inflating: /content/drive/MyDrive/workspace/dataset_v2/original/dagger.json  
  inflating: /content/drive/MyDrive/workspace/dataset_v2/original/spring-security.json  
  inflating: /content/drive/MyDrive/workspace/dataset_v2/original/thredds.json  
   creating: /content/drive/MyDrive/workspace/dataset_v2/original/dagger/
  inflating: /content/drive/MyDrive/workspace/dataset_v2/original/dagger/train_transfer.code  
  inflating: /content/drive/MyDrive/workspace/dataset_v2/original/dagger/valid_transfer.code  
 extracting: /content/drive/MyDrive/workspace/dataset_v2/original/dagger/test.comment  
  inflating: /content/drive/MyDrive/workspace/dataset_v2/original/dagger/train.comment  
  inflating: /content/drive/MyDrive/workspace/dataset_v2/original/dagger/train_transfer_all.comment  
  inflating: /content/drive/MyDrive

## Preprocess data

Here, we load the "code_to_text" portion of the [CodeXGLUE](https://microsoft.github.io/CodeXGLUE/) dataset. As you may know, GLUE [(Wang et al., 2018)](https://arxiv.org/abs/1804.07461) is a famous benchmark in NLP, which led to a lot of progress (see the leaderboard [here](https://gluebenchmark.com/)). Microsoft has now created a similar benchmark called CodeXGLUE [(Lu et al., 2021)](https://arxiv.org/abs/2102.04664), but for code + natural language instead of just natural language. It consits of several subtasks (similar to GLUE).

Let's only load the examples of the Ruby programming language. This is a fairly small dataset, which is ideally suited for demonstration purposes in Google Colab. The Python split has way more training examples (250,000), but training this in Google Colab isn't ideal.

In [9]:
project2sources = {
    'spring-boot': ['spring-framework', 'dubbo', 'flink', 'kafka', 'spring-security', 'guava', 'ExoPlayer'],
    'spring-framework': ['spring-boot', 'dubbo', 'flink', 'spring-security', 'kafka', 'ExoPlayer', 'guava'],
    'spring-security': ['spring-framework', 'spring-boot', 'dubbo', 'kafka', 'flink', 'ExoPlayer' ,'guava'],
    'guava': ['flink', 'dubbo', 'spring-framework', 'kafka', 'ExoPlayer', 'spring-boot', 'dagger'],
    'ExoPlayer': ['flink', 'spring-framework', 'guava', 'kafka', 'spring-boot', 'dubbo', 'spring-security'],
    'kafka': ['flink', 'spring-boot', 'spring-framework', 'dubbo', 'guava', 'ExoPlayer', 'spring-security'],
    'dubbo': ['spring-framework', 'spring-boot', 'flink', 'kafka', 'guava', 'spring-security', 'dagger'],
    'flink': ['kafka', 'spring-framework', 'dubbo', 'spring-boot', 'guava', 'ExoPlayer', 'spring-security'],
}
testing_project='spring-boot'
training_projects=project2sources[testing_project][:3]
validating_project=project2sources[testing_project][3]
num_data_target=100

In [10]:
%cd /media/lhbac07/DeepCom

/media/lhbac07/DeepCom


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [11]:
import json
import pickle
train_dict=dict()
train_dict_unprocessed=dict()
for project in (training_projects+ [validating_project,testing_project]):
  with open(f'../dataset_v2/original/{project}/all_unpreprocess.code', 'rb') as handle:
    b = pickle.loads(handle.read())
    train_dict_unprocessed=train_dict_unprocessed|b
  with open(f'../dataset_v2/original/{project}/all_truncated_unlowered_dict.code', 'rb') as handle:
    b = pickle.loads(handle.read())
    train_dict=train_dict|b
for key in train_dict.keys():
  value=train_dict[key]
  train_dict[key]=train_dict_unprocessed[value.strip()]

In [12]:
from transformers import (WEIGHTS_NAME, AdamW, get_linear_schedule_with_warmup,
                          RobertaConfig, RobertaModel, RobertaTokenizer)
MODEL_CLASSES = {'roberta': (RobertaConfig, RobertaModel, RobertaTokenizer)}

config_class, model_class, tokenizer_class = MODEL_CLASSES['roberta']
config = config_class.from_pretrained('microsoft/codebert-base')
tokenizer = tokenizer_class.from_pretrained('microsoft/codebert-base',do_lower_case=True)

max_input_length = 313
max_target_length = 30

class InputFeatures(object):
    """A single training/test features for a example."""
    def __init__(self,
                 example_id,
                 source_ids,
                 target_ids,
                 source_mask,
                 target_mask,

    ):
        self.example_id = example_id
        self.source_ids = source_ids
        self.target_ids = target_ids
        self.source_mask = source_mask
        self.target_mask = target_mask

def preprocess_examples(examples):
  # encode the code-docstring pairs
  codes = examples['code']
  docstrings = examples['docstring']

  #print(docstrings)
  model_inputs = tokenizer(codes, max_length=max_input_length, padding="max_length", truncation=True)

  # encode the summaries
  labels = tokenizer(docstrings, max_length=max_target_length, padding="max_length", truncation=True).input_ids

  # important: we need to replace the index of the padding tokens by -100
  # such that they are not taken into account by the CrossEntropyLoss
  labels_with_ignore_index = []
  for labels_example in labels:
    labels_example = [label if label != 0 else -100 for label in labels_example]
    labels_with_ignore_index.append(labels_example)

  model_inputs["labels"] = labels_with_ignore_index
  model_inputs["docstring"]=docstrings
  return model_inputs

def convert_examples_to_features(examples, tokenizer, args,stage=None):
    features = []
    for example_index, example in enumerate(examples):
        #source
        source_tokens = tokenizer.tokenize(example.code)[:max_input_length-2]
        source_tokens =[tokenizer.cls_token]+source_tokens+[tokenizer.sep_token]
        source_ids =  tokenizer.convert_tokens_to_ids(source_tokens) 
        source_mask = [1] * (len(source_tokens))
        padding_length = args.max_source_length - len(source_ids)
        source_ids+=[tokenizer.pad_token_id]*padding_length
        source_mask+=[0]*padding_length
 
        #target
        if stage=="test":
            target_tokens = tokenizer.tokenize("None")
        else:
            target_tokens = tokenizer.tokenize(example.docstring)[:max_input_length-2]
        target_tokens = [tokenizer.cls_token]+target_tokens+[tokenizer.sep_token]            
        target_ids = tokenizer.convert_tokens_to_ids(target_tokens)
        target_mask = [1] *len(target_ids)
        padding_length = args.max_target_length - len(target_ids)
        target_ids+=[tokenizer.pad_token_id]*padding_length
        target_mask+=[0]*padding_length   
        features.append(
            InputFeatures(
                 example_index,
                 source_ids,
                 target_ids,
                 source_mask,
                 target_mask,
            )
        )
    return features




vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [14]:
from torch.utils.data import Dataset
import torch
import numpy as np
def filter_data(codes, nls):
    """
    filter the data according to the rules
    :param codes: list of tokens of source codes
    :param asts: list of tokens of sequence asts
    :param nls: list of tokens of comments
    :return: filtered codes, asts and nls
    """
    assert len(codes) == len(nls)

    new_codes = []
    new_nls = []
    for i in range(len(codes)):
        code = codes[i]
        nl = nls[i]
        if len(code.split()) > 313 or len(nl.split()) > 30 or len(nl.split()) < 4:
            continue
        new_codes.append(code)
        new_nls.append(nl)
    return new_codes, new_nls
def get_code_from_dict(codes, dict_code):
    new_codes = []
    for i in range(len(codes)):
        code = codes[i]
        new_codes.append(dict_code[code.strip()])
    return new_codes
def get_dict_dataset(codes,nls):
  res={'code':[],'docstring':[]}
  for i in range(len(codes)):
    code=codes[i]
    nl=nls[i]
    res['code'].append(code)
    res['docstring'].append(nl.strip())
  return res
def convert_list(diction,has_doc):
  res=[]
  for i in range(len(diction['input_ids'])):
    if has_doc==False:
      res.append({'input_ids':torch.tensor(diction['input_ids'][i]),
            'attention_mask':torch.tensor(diction['attention_mask'][i]),
            'labels':torch.tensor(diction['labels'][i])})
    else:
      res.append({'input_ids':torch.tensor(diction['input_ids'][i]),
                'attention_mask':torch.tensor(diction['attention_mask'][i]),
                'labels':torch.tensor(diction['labels'][i]),
                'docstring':diction['docstring'][i]})
  return res
class CodePtrDataset(Dataset):

    def __init__(self, code_path, nl_path,dict_code,num_of_data=-1,seed=1,has_doc=False):
        # get lines
        with open(code_path, 'r', encoding='utf-8') as file:
          codes=file.readlines()
        with open(nl_path, 'r', encoding='utf-8') as file:
          nls=file.readlines()
        if num_of_data!=-1:
          np.random.seed(seed)
          sidx = np.random.permutation(len(codes))
          ele_pos=sidx[:num_of_data]
          codes=[codes[i] for i in ele_pos]
          nls=[nls[i] for i in ele_pos]

        if len(codes) != len(nls):
            raise Exception('The lengths of three dataset do not match.')
        self.has_doc=has_doc
        codes, nls = filter_data(codes, nls)

        self.codes=get_code_from_dict(codes,dict_code)
        dict_dataset=get_dict_dataset(self.codes,nls)
        self.preprocessed_dict=preprocess_examples(dict_dataset)
        #input_dicts=  [[1,2,3,4],[5,6,7,8],[9,10,11,12],...]
        self.list_dict=convert_list(self.preprocessed_dict,self.has_doc)
        #print(self.list_dict[:1])
        #print(self.list_dict)
    def __len__(self):
        return len(self.codes)

    def __getitem__(self, index):
        #return {'input_ids':self.preprocessed_dict['input_ids'][index], 'attention_mask':self.preprocessed_dict['attention_mask'][index]
        #        , 'labels': self.preprocessed_dict['labels'][index]}
        return self.list_dict[index]

# 'input_id: [[1,2,3,4],[5,6,7,8]]'
# 'input_id: [[1,5],[2,6],[3,7],[4,8]]

The goal for the model is to generate a docstring based on the provided code.

Let's now prepare the examples (i.e. code-docstring pairs) for the model. As you might know, Transformer models like BERT, BART, T5 etc. don't expect text as direct input, but rather integers which are called `input_ids` in HuggingFace Transformers. These represent tokens of a certain vocabulary. The model will learn rich contextual embedding vectors for each token, allowing it to get good results.

In other words, we need to turn the "Code" input from above into `input_ids`, and similarly, we need to turn the "Docstring" output from above into `input_ids`, which will serve as the `labels` for the model.

In addition, as these models are trained on batches of examples rather than one example at a time, we'll need to pad/truncate both the inputs and labels, such that they are all of the same length. That's why we also will add an `attention_mask` input to the model, such that it knows not to take into account padding tokens when computing attention scores.

To summarize:
* input: code, which is turned into `input_ids` + `attention_mask`
* output: docstrings, which are turned into `labels` (which are the `input_ids` of the docstrings).

Below, we define a `preprocess_examples` function, which we can apply on the entire dataset.

In [None]:
#dataset

Now that we have defined the function, let's call `.map()` on the HuggingFace Dataset object, which allows us to apply this function in batches (by default a batch size of 1,000 is used!) - hence super fast.

In [None]:
#dataset = dataset.map(preprocess_examples, batched=True)

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
#dataset

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'func_name', 'id', 'input_ids', 'labels', 'language', 'original_string', 'path', 'repo', 'sha', 'url'],
        num_rows: 24927
    })
    validation: Dataset({
        features: ['attention_mask', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'func_name', 'id', 'input_ids', 'labels', 'language', 'original_string', 'path', 'repo', 'sha', 'url'],
        num_rows: 1400
    })
    test: Dataset({
        features: ['attention_mask', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'func_name', 'id', 'input_ids', 'labels', 'language', 'original_string', 'path', 'repo', 'sha', 'url'],
        num_rows: 1261
    })
})

Next, let's set the format to "torch" and create PyTorch dataloaders.

In [84]:
from torch.utils.data import DataLoader
import torch
torch.manual_seed(1)
#dataset.set_format(type="torch", columns=['input_ids', 'attention_mask', 'labels'])

#train_dataset_1=CodePtrDataset(f'../dataset_v2/original/{testing_project}/train_transfer.code'
#,f'../dataset_v2/original/{testing_project}/train_transfer.comment',train_dict)
train_dataset_2=[]
for project in training_projects:
  train_dataset_2.append(CodePtrDataset(f'../dataset_v2/original/{project}/all_truncated_final.code'
  ,f'../dataset_v2/original/{project}/all_truncated_final.comment',train_dict,num_of_data=100))
train_dataset=torch.utils.data.ConcatDataset(train_dataset_2)

#train_dataset=torch.utils.data.ConcatDataset([train_dataset_1, train_dataset_2])
valid_dataset=CodePtrDataset(f'../dataset_v2/original/{validating_project}/all_truncated_final.code'
,f'../dataset_v2/original/{validating_project}/all_truncated_final.comment',train_dict)

#test_dataset=CodePtrDataset(f'../dataset_v2/original/{testing_project}/valid.code'
#,f'../dataset_v2/original/{testing_project}/valid.comment',train_dict,num_of_data=100,has_doc=True)

train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=32)
valid_dataloader = DataLoader(valid_dataset, batch_size=32)

#test_dataloader = DataLoader(test_dataset, batch_size=32)


In [None]:
#batch = next(iter(test_dataloader))
#print(batch)

{'input_ids': tensor([[    1,  3495, 21872,  5110,    30,  1071,   760, 16959,  3418,    32,
          5852,  1305,    34,  1240,  9858,  2016,    12,   203,  3639, 13913,
          1533,   329,  8634,    32,  5852,   797,    34,   618,  8634,    13,
           288,   203,   565,   327,  1240,  9858,   990,  2016,    12,   723,
          8634,    13,   203,  5411,   263,   464,    12, 21516,  9858,  1379,
          2016,    12,   723,  8634,  3719,   203,  5411,   263,   464,    12,
         21516,  9858,   503,  2016,    12,   723,  8634, 10019,   203,    97,
             2,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

[link text](https://)Let's verify an example, by decoding it back into text:

In [None]:
#tokenizer.decode(batch['input_ids'][2])

'<s>Summarize Java: public static DescribedPredicate<JavaField> arePublicStaticOfType(Class<?> clazz) {\n    return DescribedPredicate.describe(\n            "are public, static, and of type " + clazz.getSimpleName(),\n            field ->\n                    field.getModifiers().contains(JavaModifier.PUBLIC)\n                            && field.getModifiers().contains(JavaModifier.STATIC)\n                            && field.getRawType().isEquivalentTo(clazz));\n}</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad

In [None]:
#labels = batch['labels'][1]
#tokenizer.decode([label for label in labels if label != -100])

'<s>tests leaf return types of a method against the given predicate</s>'

## Fine-tune using PyTorch Lightning

As we will train the model using PyTorch Lightning, we first need to define a `LightningModule`, which is an `nn.Module` with some additional functionalities. We just need to define the `forward` pass, `training_step` (and optionally `validation_step` and `test_step`), and the corresponding dataloaders. PyTorch Lightning will then automate the training for us, handling device placement (i.e. we don't need to type `.to(device)` anywhere), etc. It also comes with support for loggers (such as Tensorboard, Weights and Biases) and callbacks.

Of course, you could also train the model in other ways:
* using regular PyTorch
* using the HuggingFace Trainer (in this case, the Seq2SeqTrainer)
* using HuggingFace Accelerate
* etc.

In [79]:
import torch
import torch.nn as nn
import torch
from torch.autograd import Variable
import copy
class Seq2Seq(nn.Module):
    """
        Build Seqence-to-Sequence.
        
        Parameters:

        * `encoder`- encoder of seq2seq model. e.g. roberta
        * `decoder`- decoder of seq2seq model. e.g. transformer
        * `config`- configuration of encoder model. 
        * `beam_size`- beam size for beam search. 
        * `max_length`- max length of target for beam search. 
        * `sos_id`- start of symbol ids in target for beam search.
        * `eos_id`- end of symbol ids in target for beam search. 
    """
    def __init__(self, encoder,decoder,config,beam_size=None,max_length=None,sos_id=None,eos_id=None):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder=decoder
        self.config=config
        self.register_buffer("bias", torch.tril(torch.ones(2048, 2048)))
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
        self.lsm = nn.LogSoftmax(dim=-1)
        self.tie_weights()
        
        self.beam_size=beam_size
        self.max_length=max_length
        self.sos_id=sos_id
        self.eos_id=eos_id
        
    def _tie_or_clone_weights(self, first_module, second_module):
        """ Tie or clone module weights depending of whether we are using TorchScript or not
        """
        if self.config.torchscript:
            first_module.weight = nn.Parameter(second_module.weight.clone())
        else:
            first_module.weight = second_module.weight
                  
    def tie_weights(self):
        """ Make sure we are sharing the input and output embeddings.
            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
        """
        self._tie_or_clone_weights(self.lm_head,
                                   self.encoder.embeddings.word_embeddings)        
        
    def forward(self, source_ids=None,source_mask=None,target_ids=None,target_mask=None,args=None):   
        outputs = self.encoder(source_ids, attention_mask=source_mask)
        encoder_output = outputs[0].permute([1,0,2]).contiguous()
        if target_ids is not None:  
            attn_mask=-1e4 *(1-self.bias[:target_ids.shape[1],:target_ids.shape[1]])
            tgt_embeddings = self.encoder.embeddings(target_ids).permute([1,0,2]).contiguous()
            out = self.decoder(tgt_embeddings,encoder_output,tgt_mask=attn_mask,memory_key_padding_mask=(1-source_mask).bool())
            hidden_states = torch.tanh(self.dense(out)).permute([1,0,2]).contiguous()
            lm_logits = self.lm_head(hidden_states)
            # Shift so that tokens < n predict n
            active_loss = target_mask[..., 1:].ne(0).view(-1) == 1
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = target_ids[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1))[active_loss],
                            shift_labels.view(-1)[active_loss])

            outputs = loss,loss*active_loss.sum(),active_loss.sum()
            return outputs
        else:
            #Predict 
            preds=[]       
            zero=torch.cuda.LongTensor(1).fill_(0)     
            for i in range(source_ids.shape[0]):
                context=encoder_output[:,i:i+1]
                context_mask=source_mask[i:i+1,:]
                beam = Beam(self.beam_size,self.sos_id,self.eos_id)
                input_ids=beam.getCurrentState()
                context=context.repeat(1, self.beam_size,1)
                context_mask=context_mask.repeat(self.beam_size,1)
                for _ in range(self.max_length): 
                    if beam.done():
                        break
                    attn_mask=-1e4 *(1-self.bias[:input_ids.shape[1],:input_ids.shape[1]])
                    tgt_embeddings = self.encoder.embeddings(input_ids).permute([1,0,2]).contiguous()
                    out = self.decoder(tgt_embeddings,context,tgt_mask=attn_mask,memory_key_padding_mask=(1-context_mask).bool())
                    out = torch.tanh(self.dense(out))
                    hidden_states=out.permute([1,0,2]).contiguous()[:,-1,:]
                    out = self.lsm(self.lm_head(hidden_states)).data
                    beam.advance(out)
                    input_ids.data.copy_(input_ids.data.index_select(0, beam.getCurrentOrigin()))
                    input_ids=torch.cat((input_ids,beam.getCurrentState()),-1)
                hyp= beam.getHyp(beam.getFinal())
                pred=beam.buildTargetTokens(hyp)[:self.beam_size]
                pred=[torch.cat([x.view(-1) for x in p]+[zero]*(self.max_length-len(p))).view(1,-1) for p in pred]
                preds.append(torch.cat(pred,0).unsqueeze(0))
                
            preds=torch.cat(preds,0)                
            return preds   
        
        

class Beam(object):
    def __init__(self, size,sos,eos):
        self.size = size
        self.tt = torch.cuda
        # The score for each translation on the beam.
        self.scores = self.tt.FloatTensor(size).zero_()
        # The backpointers at each time-step.
        self.prevKs = []
        # The outputs at each time-step.
        self.nextYs = [self.tt.LongTensor(size)
                       .fill_(0)]
        self.nextYs[0][0] = sos
        # Has EOS topped the beam yet.
        self._eos = eos
        self.eosTop = False
        # Time and k pair for finished.
        self.finished = []

    def getCurrentState(self):
        "Get the outputs for the current timestep."
        batch = self.tt.LongTensor(self.nextYs[-1]).view(-1, 1)
        return batch

    def getCurrentOrigin(self):
        "Get the backpointers for the current timestep."
        return self.prevKs[-1]

    def advance(self, wordLk):
        """
        Given prob over words for every last beam `wordLk` and attention
        `attnOut`: Compute and update the beam search.

        Parameters:

        * `wordLk`- probs of advancing from the last step (K x words)
        * `attnOut`- attention at the last step

        Returns: True if beam search is complete.
        """
        numWords = wordLk.size(1)

        # Sum the previous scores.
        if len(self.prevKs) > 0:
            beamLk = wordLk + self.scores.unsqueeze(1).expand_as(wordLk)

            # Don't let EOS have children.
            for i in range(self.nextYs[-1].size(0)):
                if self.nextYs[-1][i] == self._eos:
                    beamLk[i] = -1e20
        else:
            beamLk = wordLk[0]
        flatBeamLk = beamLk.view(-1)
        bestScores, bestScoresId = flatBeamLk.topk(self.size, 0, True, True)

        self.scores = bestScores

        # bestScoresId is flattened beam x word array, so calculate which
        # word and beam each score came from
        prevK = bestScoresId // numWords
        self.prevKs.append(prevK)
        self.nextYs.append((bestScoresId - prevK * numWords))


        for i in range(self.nextYs[-1].size(0)):
            if self.nextYs[-1][i] == self._eos:
                s = self.scores[i]
                self.finished.append((s, len(self.nextYs) - 1, i))

        # End condition is when top-of-beam is EOS and no global score.
        if self.nextYs[-1][0] == self._eos:
            self.eosTop = True

    def done(self):
        return self.eosTop and len(self.finished) >=self.size

    def getFinal(self):
        if len(self.finished) == 0:
            self.finished.append((self.scores[0], len(self.nextYs) - 1, 0))
        self.finished.sort(key=lambda a: -a[0])
        if len(self.finished) != self.size:
            unfinished=[]
            for i in range(self.nextYs[-1].size(0)):
                if self.nextYs[-1][i] != self._eos:
                    s = self.scores[i]
                    unfinished.append((s, len(self.nextYs) - 1, i)) 
            unfinished.sort(key=lambda a: -a[0])
            self.finished+=unfinished[:self.size-len(self.finished)]
        return self.finished[:self.size]

    def getHyp(self, beam_res):
        """
        Walk back to construct the full hypothesis.
        """
        hyps=[]
        for _,timestep, k in beam_res:
            hyp = []
            for j in range(len(self.prevKs[:timestep]) - 1, -1, -1):
                hyp.append(self.nextYs[j+1][k])
                k = self.prevKs[j][k]
            hyps.append(hyp[::-1])
        return hyps
    
    def buildTargetTokens(self, preds):
        sentence=[]
        for pred in preds:
            tokens = []
            for tok in pred:
                if tok==self._eos:
                    break
                tokens.append(tok)
            sentence.append(tokens)
        return sentence

Next, we initialize the model.

In [85]:
encoder = model_class.from_pretrained('microsoft/codebert-base',config=config)    
decoder_layer = nn.TransformerDecoderLayer(d_model=config.hidden_size, nhead=config.num_attention_heads)
decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
model=Seq2Seq(encoder=encoder,decoder=decoder,config=config,
                beam_size=5,max_length=max_target_length,
                sos_id=tokenizer.cls_token_id,eos_id=tokenizer.sep_token_id)

We can now simply start training on Colab's GPU.

In [86]:
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor

#wandb_logger = WandbLogger(name='codet5-finetune-code-summarization-python-shuffle', project='CodeT5')
# for early stopping, see https://pytorch-lightning.readthedocs.io/en/1.0.0/early_stopping.html?highlight=early%20stopping
early_stop_callback = EarlyStopping(
    monitor='validation_loss',
    patience=2,
    strict=False,
    verbose=False,
    mode='min'
)
lr_monitor = LearningRateMonitor(logging_interval='step')

trainer = Trainer(enable_checkpointing=False,accelerator='gpu',
                  default_root_dir="../PLBART-spring-boot-100-t/Checkpoints",
                  max_epochs=50,
                  callbacks=[early_stop_callback, lr_monitor])
trainer.fit(model)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/media/lhbac07/miniconda3/envs/khoaluan/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py:298: The number of training batches (10) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Once we're done training, we can also save the HuggingFace model as follows:

In [87]:
save_directory= "../PLBART-spring-boot-100-t" # save in the current working directory, you can change this of course
model.model.save_pretrained(save_directory)

Non-default generation parameters: {'forced_eos_token_id': 2}


This allows us to easily load the trained model again using the `from_pretrained()` method, as shown below.

## Inference

Now that we've trained a model, let's test it on some examples from the test set.

In [None]:
#from datasets import load_dataset

#dataset = load_dataset("code_x_glue_ct_code_to_text", "ruby")
#print(dataset['test'])

Reusing dataset code_x_glue_ct_code_to_text (/root/.cache/huggingface/datasets/code_x_glue_ct_code_to_text/ruby/0.0.0/f8b7e9d51f609a87e7ec7c7431706d4ee0b402e3398560410313d4acc67060a0)


Dataset({
    features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
    num_rows: 1261
})


In [88]:
import nltk

from rouge import Rouge
device = "cuda:0" if torch.cuda.is_available() else "cpu"

def sentence_bleu_score(reference, candidate) -> float:
    """
    calculate the sentence level bleu score, 4-gram with weights(0.25, 0.25, 0.25, 0.25)
    :param reference: tokens of reference sentence
    :param candidate: tokens of sentence generated by model
    :return: sentence level bleu score
    """
    smoothing_function = nltk.translate.bleu_score.SmoothingFunction()
    return nltk.translate.bleu_score.sentence_bleu(references=[reference],
                                                   hypothesis=candidate,
                                                   smoothing_function=smoothing_function.method4)


def corpus_bleu_score(references, candidates) -> float:
    smoothing_function = nltk.translate.bleu_score.SmoothingFunction()
    return nltk.translate.bleu_score.corpus_bleu(list_of_references=[[reference] for reference in references],
                                                 hypotheses=[candidate for candidate in candidates],
                                                 smoothing_function=smoothing_function.method4)


def meteor_score(reference, candidate):
    """
    meteor score
    :param reference:
    :param candidate:
    :return:
    """
    return nltk.translate.meteor_score.single_meteor_score(reference,
                                                           candidate, alpha=0.85, beta=0.2, gamma=0.6)


def rouge(reference, candidate):
    rouge = Rouge(metrics=['rouge-l'], max_n=4)
    result=rouge.get_scores(' '.join(candidate), ' '.join(reference))
    return result['rouge-l']['f']

def measure(batch_size, references, candidates) -> (float, float):
    """
    measures the top sentence model generated
    :param batch_size:
    :param references: batch of references
    :param candidates: batch of sentences model generated
    :return: total sentence level bleu score, total meteor score
    """
    total_s_bleu = 0
    total_meteor = 0
    total_rouge=0

    for index_batch in range(batch_size):
        reference = references[index_batch]
        candidate = candidates[index_batch]

        # sentence level bleu score
        sentence_bleu = sentence_bleu_score(reference, candidate)
        total_s_bleu = total_s_bleu+sentence_bleu

        # meteor score
        meteor = meteor_score(reference, candidate)
        total_meteor = total_meteor+ meteor

        #rouge-L
        rouge_score=rouge(reference,candidate)
        total_rouge+=rouge_score
    return total_s_bleu, total_meteor,total_rouge

def test_one_batch(batch, batch_size,model):
  candidates = []
  tokenizer = PLBartTokenizer.from_pretrained("uclanlp/plbart-base", src_lang="java", tgt_lang="en_XX")
  #print(tokenizer.lang_code_to_id)
  out=model.generate(batch['input_ids'].to(device),max_length=30,num_beams=5, decoder_start_token_id=tokenizer.lang_code_to_id["__en_XX__"])
  result=tokenizer.batch_decode(out, skip_special_tokens=True)
  for res in result:
    candidates.append(res.split())

  # outputs: [T, B, H]
  # hidden: [1, B, H]

  # translate indices into words both for candidates
  nl_batch=[]

  for nl in batch['docstring']:
    nl_batch.append(nl.split())
  #candidates = self.translate_indices(batch_sentences)

  # measure
  s_blue_score, meteor_s,rouge_score = measure(batch_size, references=nl_batch, candidates=candidates)

  return nl_batch, candidates, s_blue_score, meteor_s,rouge_score

We can load our trained model as follows:

In [None]:
from transformers import PLBartForConditionalGeneration
device = "cuda:0" if torch.cuda.is_available() else "cpu"

#model = T5ForConditionalGeneration.from_pretrained("../CodeT5-flink").to(device)
model=PLBartForConditionalGeneration.from_pretrained("../PLBART-flink-low-100").to(device)

In [27]:
import gc

torch.cuda.empty_cache()

In [83]:
from transformers import PLBartForConditionalGeneration
import config
import gc
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
import utils
device = "cuda:0" if torch.cuda.is_available() else "cpu"
total_res={}
for num_data in [100]:
    print("Num data: ",num_data)
    config.logger.info(f'Num data: {num_data}')
    res_dict=None
    for num_fold in range(5):
        res_dict=None
        for i in range(5):
            train_dataset=CodePtrDataset(f'../dataset_v2/original/{testing_project}/fold_{i}_train.code'
            ,f'../dataset_v2/original/{testing_project}/fold_{i}_train.comment',train_dict,num_of_data=num_data,seed=i)
            train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=32)
            model=CodeT5(path='../PLBART-spring-boot-100-t')
            early_stop_callback = EarlyStopping(
                monitor='validation_loss',
                patience=1,
                strict=False,
                verbose=False,
                mode='min'
            )
            lr_monitor = LearningRateMonitor(logging_interval='step')
            trainer = Trainer(enable_checkpointing=False,accelerator='gpu',
                              default_root_dir="../PLBART-spring-boot/Checkpoints",
                              max_epochs=1,limit_val_batches=0,
                              callbacks=[early_stop_callback, lr_monitor])
            trainer.fit(model)
            model=model.model.to(device)
            test_dataset=CodePtrDataset(f'../dataset_v2/original/{testing_project}/fold_{i}_test.code'
            ,f'../dataset_v2/original/{testing_project}/fold_{i}_test.comment',train_dict,has_doc=True)
            test_dataloader=DataLoader(test_dataset, batch_size=32)
            total_references = []
            total_candidates = []
            total_s_bleu = 0
            total_meteor = 0
            total_rouge=0
            sample_id = 0
            for index_batch, batch in enumerate(test_dataloader):
              batch_size = len(batch['input_ids'])
              references, candidates, s_blue_score, meteor_s,rouge_score = test_one_batch(batch, batch_size,model)
              total_s_bleu += s_blue_score
              total_meteor += meteor_s
              total_rouge +=rouge_score
              total_references += references
              total_candidates += candidates
            c_bleu = corpus_bleu_score(references=total_references, candidates=total_candidates)

            avg_s_bleu = total_s_bleu / len(test_dataset)
            avg_meteor = total_meteor / len(test_dataset)
            avg_rouge=total_rouge/len(test_dataset)
            result = {
              'c_bleu': c_bleu,
              's_bleu': avg_s_bleu,
              'meteor': avg_meteor,
              'rouge_L': avg_rouge
            }
            if res_dict==None:
                res_dict=result
            else:
                for key in res_dict.keys():
                    res_dict[key]=res_dict[key]+result[key]
            del train_dataloader,test_dataloader,train_dataset,test_dataset
            gc.collect()
            torch.cuda.empty_cache()
        for key in res_dict.keys():
            res_dict[key]=res_dict[key]/3
        if num_data not in total_res:
            total_res[num_data]=res_dict
        else:
            for key in total_res[num_data].keys():
                total_res[num_data][key]=total_res[num_data][key]+res_dict[key]
    for key in total_res[num_data].keys():
        total_res[num_data][key]=total_res[num_data][key]/5
    utils.print_test_scores(total_res[num_data],is_average=True)

for num_data in [100]:
    print(f'Num data: {num_data}')
    config.logger.info(f'Num data: {num_data}')
    utils.print_test_scores(total_res[num_data],is_average=True)

Num data:  100


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.


c_bleu: 0.12782668456865148. s_bleu: 0.12909522527649214. meteor: 0.26134107302150483. rouge_L: 0.562847095510951. 
Num data: 100
c_bleu: 0.12782668456865148. s_bleu: 0.12909522527649214. meteor: 0.26134107302150483. rouge_L: 0.562847095510951. 


In [None]:
from transformers import PLBartForConditionalGeneration
import config
import gc
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
import utils
device = "cuda:0" if torch.cuda.is_available() else "cpu"
total_res={}
for num_data in [2000]:
    print("Num data: ",num_data)
    config.logger.info(f'Num data: {num_data}')
    res_dict=None
    for num_fold in range(5):
        res_dict=None
        for i in range(2):
            train_dataset=CodePtrDataset(f'../dataset_v2/original/{testing_project}/fold_{i}_train.code'
            ,f'../dataset_v2/original/{testing_project}/fold_{i}_train.comment',train_dict,num_of_data=num_data,seed=i)
            train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=32)
            model=CodeT5(path='../PLBART-spring-framework-100')
            early_stop_callback = EarlyStopping(
                monitor='validation_loss',
                patience=1,
                strict=False,
                verbose=False,
                mode='min'
            )
            lr_monitor = LearningRateMonitor(logging_interval='step')
            trainer = Trainer(enable_checkpointing=False,accelerator='gpu',
                              default_root_dir="../PLBART-spring-boot/Checkpoints",
                              max_epochs=1,
                              callbacks=[early_stop_callback, lr_monitor])
            trainer.fit(model)
            model=model.model.to(device)
            test_dataset=CodePtrDataset(f'../dataset_v2/original/{testing_project}/fold_{i}_test.code'
            ,f'../dataset_v2/original/{testing_project}/fold_{i}_test.comment',train_dict,has_doc=True)
            test_dataloader=DataLoader(test_dataset, batch_size=32)
            total_references = []
            total_candidates = []
            total_s_bleu = 0
            total_meteor = 0
            total_rouge=0
            sample_id = 0
            for index_batch, batch in enumerate(test_dataloader):
              batch_size = len(batch['input_ids'])
              references, candidates, s_blue_score, meteor_s,rouge_score = test_one_batch(batch, batch_size,model)
              total_s_bleu += s_blue_score
              total_meteor += meteor_s
              total_rouge +=rouge_score
              total_references += references
              total_candidates += candidates
            c_bleu = corpus_bleu_score(references=total_references, candidates=total_candidates)

            avg_s_bleu = total_s_bleu / len(test_dataset)
            avg_meteor = total_meteor / len(test_dataset)
            avg_rouge=total_rouge/len(test_dataset)
            result = {
              'c_bleu': c_bleu,
              's_bleu': avg_s_bleu,
              'meteor': avg_meteor,
              'rouge_L': avg_rouge
            }
            if res_dict==None:
                res_dict=result
            else:
                for key in res_dict.keys():
                    res_dict[key]=res_dict[key]+result[key]
            del train_dataloader,test_dataloader,train_dataset,test_dataset
            gc.collect()
            torch.cuda.empty_cache()
        for key in res_dict.keys():
            res_dict[key]=res_dict[key]/3
        if num_data not in total_res:
            total_res[num_data]=res_dict
        else:
            for key in total_res[num_data].keys():
                total_res[num_data][key]=total_res[num_data][key]+res_dict[key]
    for key in total_res[num_data].keys():
        total_res[num_data][key]=total_res[num_data][key]/5
    utils.print_test_scores(total_res[num_data],is_average=True)

for num_data in [2000]:
    print(f'Num data: {num_data}')
    config.logger.info(f'Num data: {num_data}')
    utils.print_test_scores(total_res[num_data],is_average=True)

INFO:root:Configurations this run are shown below.
INFO:root:Notes: If only runs test, the model configurations shown above is not the configurations of the model test runs on.
INFO:root:
INFO:root:Features and limitations:
INFO:root:dataset_dir: ../dataset_v2
INFO:root:use_cuda: True
INFO:root:device: cuda
INFO:root:use_coverage: False
INFO:root:use_pointer_gen: False
INFO:root:use_teacher_forcing: True
INFO:root:use_lr_decay: True
INFO:root:use_early_stopping: True
INFO:root:max_code_length: 313
INFO:root:max_nl_length: 30
INFO:root:min_nl_length: 4
INFO:root:max_decode_steps: 30
INFO:root:early_stopping_patience: 5
INFO:root:
INFO:root:Train configurations:
INFO:root:embedding_dim: 256
INFO:root:hidden_size: 256
INFO:root:decoder_dropout_rate: 0.5
INFO:root:teacher_forcing_ratio: 0.5
INFO:root:batch_size: 32
INFO:root:code_encoder_lr: 0.001
INFO:root:ast_encoder_lr: 0.001
INFO:root:reduce_hidden_lr: 0.001
INFO:root:decoder_lr: 0.0001
INFO:root:lr_decay_every: 1
INFO:root:lr_decay_ra

Num data:  2000


INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                           | Params | Mode
----------------------------------------------------------------
0 | model | PLBartForConditionalGeneration | 139 M  | eval
----------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
556.883   Total estimated model params size (MB)
0         Modules in train mode
182       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO:root:c_bleu: 0.10605370296461034.
INFO:root:s_bleu: 0.10445902468105839.
INFO:root:meteor: 0.14390213190950013.
INFO:root:rouge_L: 0.2978738895249798.
INFO:root:Num data: 2000
INFO:root:c_bleu: 0.10605370296461034.
INFO:root:s_bleu: 0.10445902468105839.
INFO:root:meteor: 0.14390213190950013.
INFO:root:rouge_L: 0.2978738895249798.


c_bleu: 0.10605370296461034. s_bleu: 0.10445902468105839. meteor: 0.14390213190950013. rouge_L: 0.2978738895249798. 
Num data: 2000
c_bleu: 0.10605370296461034. s_bleu: 0.10445902468105839. meteor: 0.14390213190950013. rouge_L: 0.2978738895249798. 


In [18]:
text = "def greet(user): print(f'hello <extra_id_0>!')"
input_ids = tokenizer(text, return_tensors="pt").input_ids
print(input_ids)
print(tokenizer.cls_token)

tensor([[    0,  9232, 17395,  1640, 12105,  3256,  5780,  1640,   506,   108,
         42891, 28696, 30842,  1215,   808,  1215,   288, 15698,   328, 27645,
             2]])
<s>


We can prepare the example using `RobertaTokenizer`, and generate using the `.generate()` method. Note that there are several ways of doing generation (greedy decoding/beam search/top k sampling/etc.), for that I refer to Patrick's blog post which you can find [here](https://huggingface.co/blog/how-to-generate). Here we will just use the default settings (i.e. greedy decoding).

In [None]:
# prepare for the model
input_ids = tokenizer(test_example['code'], return_tensors='pt').input_ids
# generate
outputs = model.generate(input_ids,num_beams=5 )
print("Generated docstring:", tokenizer.decode(outputs[0], skip_special_tokens=True))

NameError: name 'test_example' is not defined

Let's compare this to the ground-truth docstring:

In [None]:
print("Ground truth:", test_example['docstring'])

Ground truth: make sure to never prune the ejson-keys secret


## Upload trained model to the hub

Cool! We can also share our model with the world, by uploading it to [hf.co](https://hf.co). For that, we need to install Git-LFS, which is used for using git with large files (note that each model on the hub = a git repository!).

In [None]:
!sudo apt-get install git-lfs
!git config --global user.email "niels.rogge1@gmail.com"
!git config --global user.name "Niels Rogge"

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 40 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 1s (1,837 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package git-lfs.
(Reading database ... 148492 files and directories c

Next, we can login with the credentials of our HuggingFace account (you can sign up on [hf.co](https://hf.co) if you haven't already!).

In [None]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        
Username: nielsr
Password: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-crendential store but this isn't the helper defined on your machine.
You will have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal to set it as the default

git config --global credential.helper store[0m


In [None]:
repo_url = "https://huggingface.co/nielsr/codet5-small-code-summarization-ruby"

In [None]:
from huggingface_hub import Repository

repo = Repository(local_dir="checkpoint", # note that this directory must not exist already
                  clone_from=repo_url,
                  git_user="Niels Rogge",
                  git_email="niels.rogge1@gmail.com",
                  use_auth_token=True,
)

Cloning https://huggingface.co/nielsr/codet5-small-code-summarization-ruby into local empty directory.


In [None]:
model.save_pretrained("/content/checkpoint")
tokenizer.save_pretrained("/content/checkpoint")

In [None]:
# push to hub
repo.push_to_hub(commit_message="First commit")

Upload file pytorch_model.bin:   0%|          | 3.43k/231M [00:00<?, ?B/s]

'https://huggingface.co/nielsr/codet5-small-code-summarization-ruby/commit/338c3a3b3f8d19dd32d2e881948a2236f09945e9'