## Introduction

In recent time, large, pre-trained, language models had shown high potential is several tasks. Such as Question Answering, Sentiment Analysis, Abstractive Summary etc. Some of the important models include [ELMo (Peters
et al., 2018)](https://arxiv.org/abs/1802.05365), [GPT (Radford et al., 2018)](https://arxiv.org/abs/2005.14165), [BERT (Devlin et al., 2018)](https://arxiv.org/pdf/1810.04805.pdf), [XLNet (Yang et al., 2019)](https://arxiv.org/pdf/1906.08237.pdf), and [RoBERTa (Liu et al., 2019)](https://arxiv.org/pdf/1907.11692.pdf).

They all follow the base architecture proposed by Vaswani et. al. in their Seminal Paper: [Attention is All You Need (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762)

Following the The naturalness hypothesis of source code, proposed by Allamanis et. al. it is thus preferable to try to treat large code corpora in the similar fashion and exploit their Statistical Properties. 

In this colab, we will see an end to end pipeline, using [Hugging Face Transformers Library](https://huggingface.co/transformers/), [Microsoft's Open Source Large Scale code model CodeBERT](https://huggingface.co/microsoft/codebert-base), and [Codist tree-hugger](https://github.com/autosoft-dev/tree-hugger) how to use similar technology to a very challenging problem called Code Summarization

## Introduction

Microsoft Research Asia working together with Developer Division and Bing introduce [CodeXGLUE](https://github.com/microsoft/CodeXGLUE), a **benchmark dataset and open challenge for code intelligence**.

It includes 14 datasets ([CodeSearchNet](https://github.com/github/CodeSearchNet), [Py150](https://eth-sri.github.io/py150)...) for 10 diversified code intelligence tasks. Those datasets are all created from Open Source repos. CodeXGLUE also includes baseline model implementations.

CodeXGLUE is for code what ImageNet is for Computer Vision or GLUE for NLP. 




**🤔 BUT**

What if you want to **add your own dataset to these pre-built ones** or **test the baseline models on your code**?



## tree-hugger: code pre-processing library

At Codist, we recently open sourced our **code processing library** [tree-hugger](https://github.com/autosoft-dev/tree-hugger). In this tutorial we will show you how to :
* install and set tree-hugger code processing library
* create your own dataset similar to the Open Source dataset supplied by CodeXGLUE



🏆 You can then **test the baseline model** on your own data and see how it performs

### Let's install tree-hugger

In [None]:
!pip install -U tree-hugger PyYAML

### And use this command to build the necessary processing libary

In [None]:
!create_libs -c python

2020-10-10 10:53:52,071 INFO:Cloneing python repo from tree-sitter collections
2020-10-10 10:54:02,856 INFO:Creating the library my-languages.so at /content
2020-10-10 10:54:03,668 INFO:Finished creating library!


Now that we have all the necessary set-up done, let's download some files. For the ease of the tutorial we have created a small github example repo with a collection of files. Some of it is coming from Open Source repos and some we created as example files. 

Let's clone that

In [None]:
!git clone https://github.com/autosoft-dev/example-files.git

Cloning into 'example-files'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 16 (delta 2), reused 11 (delta 0), pack-reused 0[K
Unpacking objects: 100% (16/16), done.


We are going to declare a small function that will help us go over each files in a nested directory tree (like the one above we cloned) and get each file at a time. 

In [None]:
from pathlib import Path

def check_out_path(target_path: Path):
    """"
    This function recursively yields all contents of a pathlib.Path object
    """
    yield target_path
    for file in target_path.iterdir():
        if file.is_dir():
            yield from check_out_path(file)
        else:
            yield file.absolute()


def is_python_file(file_path: Path):
  """
  This little function will help us to filter the result and keep only the python files
  """
  return file_path.is_file() and file_path.suffix == ".py"

In [None]:
for file_path in check_out_path(Path("example-files")):
  if is_python_file(file_path):
    print(file_path)

/content/example-files/simple_funcs/simple_funcs.py
/content/example-files/flask_files/cli.py
/content/example-files/api.py
/content/example-files/inner_dir/_internal_utils.py


And now, we will define another small function, which, given a string which represents Python code will tokeize that

In [None]:
from tokenize import tokenize
from io import BytesIO


def tokenize_code_string(text):
    code_tokens = []
    for tok in tokenize(BytesIO(text.encode('utf-8')).readline):
        if tok.string.strip() != "" and tok.string.strip() != "utf-8":
            code_tokens.append(tok.string.strip().lower())
    return code_tokens

That is all for the pre-processing. Let's use tree-hugger's powerful API in conjunction with those functions to define a dataset from those custom files

In [None]:
# We first create our PythonParser object
from tree_hugger.core import PythonParser

In [None]:
pp = PythonParser(library_loc="/content/my-languages.so")

In [None]:
# We will now define a dict and populate it with the necessary data in a for loop

final_out_put_data = {}

for file_path in check_out_path(Path("example-files")):
  if is_python_file(file_path):
    final_out_put_data[file_path.stem] = None
    # we use one line, super convinient tree-hugger API call to get the needed data
    if pp.parse_file(str(file_path)):
      temp_cache = []
      # The following call returns a dict where each key is a name of a function
      # And each value is a tuple, (function_body, function_docstring)
      func_and_docstr = pp.get_all_function_bodies(strip_docstr=True)
      for func_name, (body, docstr) in func_and_docstr.items():
        code_tokens = tokenize_code_string(body)
        # Let's strip out all the internal comments
        final_code_tokens = [t for t in code_tokens if not t.startswith("#")]
        # Split the first line of docstring and remove all the tripple quotes and strip white spaces and make it lower
        docstr_tokens = docstr.split("\n")[0].strip().replace('"""', '').replace("'''", "").lower().split()
        temp_cache.append({"code": final_code_tokens, "docstr": docstr_tokens})
      # Let's add the result to the final output
      final_out_put_data[file_path.stem] = temp_cache

In [None]:
# And we are DONE!

final_out_put_data["api"][0]

{'code': ['def',
  'request',
  '(',
  'method',
  ',',
  'url',
  ')',
  ':',
  'with',
  'sessions',
  '.',
  'session',
  '(',
  ')',
  'as',
  'session',
  ':',
  'return',
  'session',
  '.',
  'request',
  '(',
  'method',
  '=',
  'method',
  ',',
  'url',
  '=',
  'url',
  ',',
  '**',
  'kwargs',
  ')'],
 'docstr': ['constructs',
  'and',
  'sends',
  'a',
  ':class:`request',
  '<request>`.']}

With this code, you can very easily create a dataset out of your own code files and then test the baseline models against it. 


That was easy!

 (We are about to release `docly` a small command line tool which helps you to write function documentation for your Python code and we use the same parsing technique there as well 😀 )