## Introduction

In recent time, large, pre-trained, language models had shown high potential is several tasks. Such as Question Answering, Sentiment Analysis, Abstractive Summary etc. Some of the important models include [ELMo (Peters
et al., 2018)](https://arxiv.org/abs/1802.05365), [GPT (Radford et al., 2018)](https://arxiv.org/abs/2005.14165), [BERT (Devlin et al., 2018)](https://arxiv.org/pdf/1810.04805.pdf), [XLNet (Yang et al., 2019)](https://arxiv.org/pdf/1906.08237.pdf), and [RoBERTa (Liu et al., 2019)](https://arxiv.org/pdf/1907.11692.pdf).

They all follow the base architecture proposed by Vaswani et. al. in their Seminal Paper: [Attention is All You Need (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762)

Following the The naturalness hypothesis of source code, proposed by Allamanis et. al. it is thus preferable to try to treat large code corpora in the similar fashion and exploit their Statistical Properties. 

In this colab, we will see an end to end pipeline, using [Hugging Face Transformers Library](https://huggingface.co/transformers/), [Microsoft's Open Source Large Scale code model CodeBERT](https://huggingface.co/microsoft/codebert-base), and [Codist tree-hugger](https://github.com/autosoft-dev/tree-hugger) how to use similar technology to a very challenging problem called Code Summarization

## Background

![transformer](https://lilianweng.github.io/lil-log/assets/images/transformer.png)

<h3 align="center">The full model architecture of the transformer. (Image source: Fig 1 & 2 in Vaswani, et al., 2017.)</h3>


We won't go in detail of the transformer architecure. Because that is not really in the scope of this tutorial and there are plenty of very good resources that do justice to it. Please check out [here](https://jalammar.github.io/illustrated-transformer/), [here](http://nlp.seas.harvard.edu/2018/04/03/attention.html), and [here](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html). 

For this particular tutorial we are using Microsoft CodeBERT as the baseline model and this model has been proposed in a [paper](https://arxiv.org/pdf/2002.08155.pdf) in 2020 by Zhangyin Feng et. al. It is [freely available](https://huggingface.co/microsoft/codebert-base) via Hugging Face model repository. 

We at [Codist](https://codist-ai.com/) has released a new model and a tool called `docly` that does exactly same work (and a lot more!). We found our model slightly outperforming MS CodeBERT in some cases. If you are interested in it, please go to the website and sign up for a beta.

MS CodeBERT has been trained on a hybrid objective function where the model was simultaneously predicting both masked tokens and replaced tokens. So the final loss of the model can be expressed by

$$
min_\theta = L_{MLM}(\theta) + L_{RTD}(\theta)
$$


## Let's start the coding

We first download two companion files where we have some useful function and also the main model architecture code

In [None]:
!wget https://raw.githubusercontent.com/autosoft-dev/ml-on-code/main/assets/model.py
!wget https://raw.githubusercontent.com/autosoft-dev/ml-on-code/main/assets/utils.py

### Let's install tree-hugger and transformers

In [None]:
!pip install transformers
!pip install -U tree-hugger PyYAML

### And use this command to build the necessary processing libary (tree-hugger related)

In [4]:
!create_libs -c python

2020-11-25 19:14:22,466 INFO:Cloneing python repo from tree-sitter collections
2020-11-25 19:14:34,160 INFO:Creating the library my-languages.so at /content
2020-11-25 19:14:35,072 INFO:Finished creating library!


**Let's import all necessary modules**

In [5]:
import os
import json
import torch
import torch.nn as nn
from model import Seq2Seq
from utils import Example, convert_examples_to_features
from transformers import RobertaConfig, RobertaModel, RobertaTokenizer
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

Now that we have everything, let's download the fine-tuned model. Codist has fine tuned this model for your testing purpose 😃

In [6]:
!wget https://code-summary.s3.amazonaws.com/pytorch_model.bin

--2020-11-25 19:17:03--  https://code-summary.s3.amazonaws.com/pytorch_model.bin
Resolving code-summary.s3.amazonaws.com (code-summary.s3.amazonaws.com)... 52.217.41.172
Connecting to code-summary.s3.amazonaws.com (code-summary.s3.amazonaws.com)|52.217.41.172|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 706871064 (674M) [application/macbinary]
Saving to: ‘pytorch_model.bin’


2020-11-25 19:17:13 (65.9 MB/s) - ‘pytorch_model.bin’ saved [706871064/706871064]



In [17]:
## We are defining all the needed functions here. 
def inference(data, model, tokenizer):
    # Calculate bleu
    eval_sampler = SequentialSampler(data)
    eval_dataloader = DataLoader(data, sampler=eval_sampler, batch_size=len(data))

    model.eval()
    p = []
    for batch in eval_dataloader:
        batch = tuple(t.to('cpu') for t in batch)
        source_ids, source_mask = batch
        with torch.no_grad():
            preds = model(source_ids=source_ids, source_mask=source_mask)
            for pred in preds:
                t = pred[0].cpu().numpy()
                t = list(t)
                if 0 in t:
                    t = t[: t.index(0)]
                text = tokenizer.decode(t, clean_up_tokenization_spaces=False)
                p.append(text)
    return (p, source_ids.shape[-1])


def get_features(examples, tokenizer):
    features = convert_examples_to_features(
        examples, tokenizer, stage="test"
    )
    all_source_ids = torch.tensor(
        [f.source_ids[: 256] for f in features], dtype=torch.long
    )
    all_source_mask = torch.tensor(
        [f.source_mask[: 256] for f in features], dtype=torch.long
    )
    return TensorDataset(all_source_ids, all_source_mask)


def build_model(model_class, config, tokenizer):
    encoder = model_class(config=config)
    decoder_layer = nn.TransformerDecoderLayer(
        d_model=config.hidden_size, nhead=config.num_attention_heads
    )
    decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
    model = Seq2Seq(
        encoder=encoder,
        decoder=decoder,
        config=config,
        beam_size=10,
        max_length=128,
        sos_id=tokenizer.cls_token_id,
        eos_id=tokenizer.sep_token_id,
    )

    model.load_state_dict(
        torch.load(
            "pytorch_model.bin",
            map_location=torch.device("cpu"),
        ),
        strict=False,
    )
    return model

Now that we have all the needed functions, let's load the baseline model from Hugging Face model hub

In [8]:
config = RobertaConfig.from_pretrained("microsoft/codebert-base")
tokenizer = RobertaTokenizer.from_pretrained(
    "microsoft/codebert-base", do_lower_case=False
)

model = build_model(
    model_class=RobertaModel, config=config, tokenizer=tokenizer
).to('cpu')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=25.0, style=ProgressStyle(description_w…




Everything is ready to make predictions!! Let's do it.

In [9]:
example = [Example(source="def add_tensors(t, t1) -> Any:\n    return t + t1", target=None)]
message, length = inference(get_features(example, tokenizer), model, tokenizer)
print(message)

['Add two tensors .']


**AMAZING!!**

We need to be able to run it on a bunch of files and extract the functions from it and then predict their docstrings. How shall we do it?


Codist [tree-hugger](https://github.com/autosoft-dev/tree-hugger) to the rescue!

For the ease of the tutorial we have created a small github example repo with a collection of files. Some of it is coming from Open Source repos and some we created as example files. 

Let's clone that

In [10]:
!git clone https://github.com/autosoft-dev/example-files.git

Cloning into 'example-files'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 16 (delta 2), reused 11 (delta 0), pack-reused 0[K
Unpacking objects: 100% (16/16), done.


We are going to declare a small function that will help us go over each files in a nested directory tree (like the one above we cloned) and get each file at a time. 

In [11]:
from pathlib import Path

def check_out_path(target_path: Path):
    """"
    This function recursively yields all contents of a pathlib.Path object
    """
    yield target_path
    for file in target_path.iterdir():
        if file.is_dir():
            yield from check_out_path(file)
        else:
            yield file.absolute()


def is_python_file(file_path: Path):
  """
  This little function will help us to filter the result and keep only the python files
  """
  return file_path.is_file() and file_path.suffix == ".py"

In [12]:
for file_path in check_out_path(Path("example-files")):
  if is_python_file(file_path):
    print(file_path)

/content/example-files/simple_funcs/simple_funcs.py
/content/example-files/inner_dir/_internal_utils.py
/content/example-files/flask_files/cli.py
/content/example-files/api.py


We are now ready to use tree-hugger to parse all the needed files and let's do that

In [13]:
# We first create our PythonParser object
from tree_hugger.core import PythonParser

In [14]:
pp = PythonParser(library_loc="/content/my-languages.so")

In [16]:
# Let's use the function we defined before to go over all the files.
for file_path in check_out_path(Path("example-files")):
  if is_python_file(file_path):
    # we use one line, super convinient tree-hugger API call to get the needed data
    if pp.parse_file(str(file_path)):
      temp_cache = []
      # The following call returns a dict where each key is a name of a function
      # And each value is a tuple, (function_body, function_docstring)
      func_and_docstr = pp.get_all_function_bodies(strip_docstr=True)
      for func_name, (body, docstr) in func_and_docstr.items():
        example = [Example(source=body, target=None)]
        message, length = inference(get_features(example, tokenizer), model, tokenizer)
        print(func_name, " ".join(message))
      # Let's add the result to the final output

add Add two vectors .
check_even_numbers_in_a_list Checks that all numbers in a list are equal .
open_file open a file
add_tensors Add two tensors .
to_native_string Convert string to native string .
unicode_is_ascii Check if unicode is ASCII .
find_best_app Find the best Flask application in a module .
call_factory Call app factory .
_called_with_wrong_args Check if a function has wrong arguments .
find_app_by_string Find application by name or function name .
prepare_import Prepare a python import .
locate_app Locate a Flask application .
get_version Print current version
_load_app Load the lock .
with_appcontext A decorator that adds a click context to a function .
decorator Wrapper for click . Command .
_path_is_ancestor Check if path is an ancestor of other .
load_dotenv Load . env file .
show_server_banner Show the server banner .
_validate_key Ensure key is valid .
run_command Run a werkzeug command .
shell_command Create a shell command .
routes_command Executor for globus rout

With this code, you can very easily create a dataset out of your own code files and then test the baseline models against it. 


That was easy!

 As said earlier, codist just relased `docly` we use similar parsing and modelling methods to generate docstring (with arguments and several other things that is not present in this baseline model). If you are interested, please have a look [here](https://codist-ai.com/)