Group 6 -- Code Completion (token level)

Here is the introduction and pipeline for token level code completion task.

Task Definition

Predict next code token given context of previous tokens. Models are evaluated by token level accuracy.

Code completion is a one of the most widely used features in software development through IDEs. An effective code completion tool could improve software developers' productivity. We provide code completion evaluation tasks in two granularities -- token level and line level. Here we introduce token level code completion. Token level task is analogous to language modeling. Models should have be able to predict the next token in arbitary types.

Download links to our datasets and models

Datasets

If you want to do it yourself from scratch, without docker (docker also reproduces these), take a look at entrypoint.sh. Do not execute this file directly, it's made to be run by docker, but it should provide enough information for you to do this locally.

The datasets for py150 and javaCorpus are not created by us, follow these steps to download and preprocess those datasets. Our datasets can be found here, they should be saved as is in the dataset folder, they should override the files in the folders you have locally, if you have already downloaded the datasets.

Pre-trained models

Download these zip files, and extract their contents as is to the save directory. The below files are for the single language models.

Docker

Use these steps to easily run the code in a docker container.

Prerequisites

Install NVidia drivers from here
Install docker
Make sure nvidia-container-toolkit is installed
Any nvidia driver issues are left as an exersise to the reader

Dataset

Build dataset creator with the following command: (this is quick)

docker build -t dataset -f dataset.Dockerfile .

Run dataset collector with the following command: (this is slow, like really slow) If you see the following lines:

...
data/despawnerer/summarize/summarize/
data/despawnerer/summarize/summarize/__init__.py
data/despawnerer/summarize/summarize/language.py
data/despawnerer/summarize/summarize/summarize.py
data/despawnerer/summarize/setup.py

It's not stuck! It's just running the tokenizers.

docker run --mount type=bind,source="$(pwd)"/dataset,target=/dataset dataset

Training

Build the training image with the following command where CUDA_VERSION can be one of cu116, cu113, cu102, cpu: (this is quite slow)

docker build -t token_completion . --build-arg CUDA_VERSION=[CUDA_VERSION]

Run the trainer with the following command where MAKE_TARGET is a target from Makefile:

docker run --gpus all --mount type=bind,source=$(pwd)/dataset--mount type=bind,source=$(pwd)/save,target=/save,target=/save --mount type=bind,source=$(pwd)/logs,target=/logs token_completion [MAKE_TARGET]

Or run everything we got!

docker run --gpus all --mount type=bind,source=$(pwd)/dataset--mount type=bind,source=$(pwd)/save,target=/save,target=/save --mount type=bind,source=$(pwd)/logs,target=/logs --entrypoint bash token_completion [eval-all.sh | run-all.sh]

Expected results

In order to read the results, if everything went well, you can find them in the last row of the respective log file in the logs folder.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
code		code
dataset		dataset
evaluator		evaluator
images		images
logs		logs
notebooks		notebooks
save		save
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CodeXGLUE.md		CodeXGLUE.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
dataset.Dockerfile		dataset.Dockerfile
entrypoint.sh		entrypoint.sh
eval-all.sh		eval-all.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run-all.sh		run-all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Group 6 -- Code Completion (token level)

Task Definition

Download links to our datasets and models

Datasets

Pre-trained models

Docker

Prerequisites

Dataset

Training

Expected results

About

Releases

Packages

Contributors 5

Languages

ML4SE2022/group6

Folders and files

Latest commit

History

Repository files navigation

Group 6 -- Code Completion (token level)

Task Definition

Download links to our datasets and models

Datasets

Pre-trained models

Docker

Prerequisites

Dataset

Training

Expected results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages