Datahack 2019 Project
This is an auto code description generation and code search engine project.
It's based on StackOverFlow dataset, focusing on Data Science and Data Structures fields only.
src
Main source code directory.rsources
External resources necessary for running this project, like the BERT's vocabularytxt
file.scripts
Helper scripts such that:- SQL query for fetching the StackOverFlow data from Google's BigQuery service.
shell
scripts for running each step in the process.
- Make sure to have
Python 3.6
- Install
pipenv
bypip install pipenv
- In your terminal, create a new virtual environment inside a new shell, using the command
pipenv shell
(make sure to run all commands inside this shell to not affect your global environment settings). This should create a.venv
folder inside the project's root folder. - Install all the requirements using the
Pipfile
andPipfile.lock
files by running the following command:pipenv sync
. Note: If using a GPU machine (recommended) one needs to changetensorflow
totensorflow-gpu
,
- Fetch the data from Google's BigQuery service using the script
scripts/bigquery_stackoverflow.sql
. They supply free trail of 300$ which is more than enough for this task. - Clone Google's BERT code into
src/bertcode/bert
(the folder is in.gitignore
). - Download BERT's base uncased model for English into
bert/models/uncased_L-12_H-768_A-12
(the folder is in.gitignore
)
Please follow the scripts in resources
folder for all running examples.
Acknowledgements
- Amenity Analytics for the credit and resources. Thanks!
- Main idea derived from hamelsmu with some modifications to fit to the problem of generating comment from code, mainly in the data, pre-processing, cleaning and sentence embedding mechanism.