Here is the introduction and pipeline for token level code completion task.
Predict next code token given context of previous tokens. Models are evaluated by token level accuracy.
Code completion is a one of the most widely used features in software development through IDEs. An effective code completion tool could improve software developers' productivity. We provide code completion evaluation tasks in two granularities -- token level and line level. Here we introduce token level code completion. Token level task is analogous to language modeling. Models should have be able to predict the next token in arbitary types.
If you want to do it yourself from scratch, without docker (docker also reproduces these), take a look at entrypoint.sh. Do not execute this file directly, it's made to be run by docker, but it should provide enough information for you to do this locally.
The datasets for py150
and javaCorpus
are not created by us, follow these steps to download and preprocess those datasets.
Our datasets can be found here, they should be saved as is in the dataset
folder, they should override the files in the folders you have locally, if you have already downloaded the datasets.
Download these zip files, and extract their contents as is to the save
directory.
The below files are for the single language models.
- Pre-trained on Python, fine-tuned on python
- Pre-trained on Python, fine-tuned on JavaScript
- Pre-trained on Python, fine-tuned on TypeScript
- Pre-trained on Java, fine-tuned on JavaScript
- Pre-trained on Java, fine-tuned on TypeScript
- Pre-trained on Python, fine-tuned on JavaScript then TypeScript
- Pre-trained on Python, fine-tuned on JavaScript then TypeScript then Python
Use these steps to easily run the code in a docker container.
- Install NVidia drivers from here
- Install docker
- Make sure
nvidia-container-toolkit
is installed - Any nvidia driver issues are left as an exersise to the reader
Build dataset creator with the following command: (this is quick)
docker build -t dataset -f dataset.Dockerfile .
Run dataset collector with the following command: (this is slow, like really slow) If you see the following lines:
...
data/despawnerer/summarize/summarize/
data/despawnerer/summarize/summarize/__init__.py
data/despawnerer/summarize/summarize/language.py
data/despawnerer/summarize/summarize/summarize.py
data/despawnerer/summarize/setup.py
It's not stuck! It's just running the tokenizers.
docker run --mount type=bind,source="$(pwd)"/dataset,target=/dataset dataset
Build the training image with the following command where CUDA_VERSION
can be one of cu116, cu113, cu102, cpu
: (this is quite slow)
docker build -t token_completion . --build-arg CUDA_VERSION=[CUDA_VERSION]
Run the trainer with the following command where MAKE_TARGET
is a target from Makefile:
docker run --gpus all --mount type=bind,source=$(pwd)/dataset--mount type=bind,source=$(pwd)/save,target=/save,target=/save --mount type=bind,source=$(pwd)/logs,target=/logs token_completion [MAKE_TARGET]
Or run everything we got!
docker run --gpus all --mount type=bind,source=$(pwd)/dataset--mount type=bind,source=$(pwd)/save,target=/save,target=/save --mount type=bind,source=$(pwd)/logs,target=/logs --entrypoint bash token_completion [eval-all.sh | run-all.sh]
In order to read the results, if everything went well, you can find them in the last row of the respective log file in the logs
folder.