Code summarization dataset

Tool for mining data from GitHub for code summarization tasks.

Installation

Follow these steps to run the tool:

Clone repo from Github

git clone https://github.com/JetBrains-Research/code-summarization-dataset.git

Tool modules

I. Filtration: filtering input GitHub repositories urls with specified filters (config: config/filter_config.json)

II. Analysis: methods summary extraction from repository or local directory / file (config: config/analysis_config.json)

III. Provider: filtration + analysis = continuous filtration end analysis of repositories

I. Repositories filtering (filtration module)

Input: list of urls to existing GitHub repositories in format .../REPOOWNER/REPONAME (exactly 2 slashes) and filtration config

Output: lists of 'good' and 'bad' repositories after filters applying with explanation about filters results

1. Filtration config

filtration config is .json file with all repo filters and run parameters:

{
    "token_path" : "config/token.txt",           // path to GitHub token
    "repos_urls_path": "config/repos.json",      // path to repos URLs
    "dump_dir_path" : "config/filter_results",   // dump directory path
    "languages": ["Java", "Python"],             // list of languages
    "stars_count": [">=", 10],                   // relations with integers
    "is_fork": [false],                          // boolean flag
    "is_license": [true],
    "licenses": ["gpl-3.0", "apache-2.0"],       // list of licenses
    "commits_count": [0, 100000],                // integer ranges
    "contributors_count": [">=", 10],
    "anon_contributors": [true],                 
    "watchers_count": [],                        // empty list == no filter
    "forks_count": [10, 100000],
    "open_issues_count": [0, 100000],
    "subscribers_count": [],
    "size_KB": ["<=", 10000000],
    "created_at": [">=","2010-01-01"],           // relations with dates
    "updated_at": ["2010-01-01", "2015-01-01"],  // dates ranges
    "pushed_at": []
}

GitHub token

filtration requires a GitHub API personal access token without any special permissions
token is 40 symbols code that must be located in a separate file on the first line without additional data

file with path token_path must contain this token

file: filter_config.json

...
"token_path": token.txt
...

file: token.txt

496**********************************4cf

Filtration config rules

each parameter must be specified in brackets [params, ...]
dates in "YYYY-MM-DD" format with quotes
all integer filters support relation (>, <, <=, >=, =) in quotes ["<", N]
all integer filters support ranges in brackets [min incl., max incl.]
all date filters support relation (>, <, <=, >=, =) in quotes [">=", "YYYY-MM-DD"]
all date filters support ranges in brackets [min incl., max incl.]
all date and integer filters support implicit EQ (=) relation [N] == ["=", N]
licenses:
- "is_license": [] - repository hasn't or has any license, values in "licenses": [...] field are ignored
- "is_license": [false] - repository hasn't license, values in "licenses": [...] field are ignored
- "is_license": [true] - repository has license
  - "licenses": [] - repository has any license
  - "licenses": [...] - repository has license from the list
licenses not bound in the code
licenses filter running by keyword from GitHub licenses list provided by license/key path in GitHub API v3 summary https://api.github.com/repos/REPOOWNER/REPONAME for each repository

Examples

"languages": ["Kotlin", "C++", "Haskell"] --> repository main language is Kotlin OR C++ OR Haskell
"[param]_count": [42] --> repository [param]_count == 42
"[param]_count": ["<=", 42] --> repository [param]_count <= 42
"[param]_count": [42, 128] --> 42 <= repository [param]_count <= 128
"pushed_at": ["2010-01-01"] --> repository push date = 2010.01.01
"created_at": [">=","2010-01-01"] --> repository creation date >= 2010.01.01
"updated_at": ["2010-01-01", "2015-01-01"] --> 2010.01.01 <= repository update date <= 2015.01.01

2. Run

in .json file repos_urls_path add any GitHub repositories in format .../REPOOWNER/REPONAME (exactly 2 slashes)

file: filter_config.json

...
"repos_urls_path": repos.json
...

file: repos.json

[ "/JetBrains/Kotlin", "/JetBrains/intellij-community"] 
or 
["https://github.com/JetBrains/Kotlin", "https://github.com/JetBrains/intellij-community"]

2.1 as code in project

import FilterConfig and ReposFilter classes

provide a path to filter_config.json, initialize config and finder

val filterConfig = FilterConfig(configPath = filterConfigPath, isDebug = isDebug)
val reposFinder = ReposFilter(config = filterConfig)
reposFinder.run()

2.2 as separate module

write own entry point

import filtration.utils.FilterParser

fun main(args: Array<String>) = FilterParser().main(args)

run with script and command line arguments

#!/bin/bash
./gradlew :run --args="--mode=filter --fd --fc config/filter_config.json"

arguments

-m, --mode                         - work mode: filter, analysis, filter-analysis
--fc, --filter-config              - path to filtration config .json file
--fd, --filter-debug, -d --debug   - flag, print all log messages to the console

3. Results

In dump_dir_path appear 4 files and 2 folders:

folders good(bad) each with inner folder explain
files good(bad)_input_urls.jsonl -- good (bad) input urls
files good(bad)_repos.jsonl -- all good (bad) traversed repos

good(bad) folders contain repositories summary for each repository

good(bad)/explain contain explanation about results of applied filters (from config file) to each repository

4. Search data sources

commits_count - 1 GraphQL query:

query {
  repository(owner: "JetBrains-Research", name: "code-summarization-dataset") {
    defaultBranchRef {
      target {
        ... on Commit {
          history (first: 1) {
            totalCount
            pageInfo
            { endCursor }
          }
        }
      }
    }
  }
}

contributors_count - 1 API v3 query with pagination hack (1 contributor per_page + number of pages):

https://api.github.com/repos/jetbrains/kotlin/contributors?per_page=1&anon=false

others from repository summary api/repos/owner/name - 1 API v3 query:

https://api.github.com/repos/jetbrains/kotlin

II. Repositories analysis (analysis module)

Currently supported languages: Java, Python

for Python data retrieving you need pythonparser in your system PATH

Input: repository or directory and analysis config

Output: summary about all functions from repository or directory for supported languages:

function name and fullname
function documentation or multiline comment
function body
function AST
function AST paths in code2seq format
metadata (file path, commit info, extraction statistics)

1. Analysis config

analysis config is .json file with run parameters:

{
  "HISTORY_MODE": true,                               // main feature of tool (see below) 

  "repos_urls_path": "config/repos.json",             // path to .json list with urls (in format /{OWNER}/{NAME}) to GitHub repos 
  "files_list_path": "config/files.json",             // path to .json list with paths to local directories or files
  "dump_dir_path" : "config/analysis_results",        // path to tool dump directory

  "workers_count": 3,                                 // how many workers in parallel (thread pool size)  
  "log_dump_threshold": 200,                          // log messages dump to file threshold
  "summary_dump_threshold": 200,                      // methods summary dump threshold

  "gzip_files": true,                                 // whether the extracted data should be gziped
  "remove_after_gzip": false                          // whether the all not gziped extracted data should be deleted
  "remove_repo_after_analysis": false,                // whether the repository should be deleted after analysis
  
  "commits_type": "merges",                           // commits type (merges or first_parents, see explanation below)
  "min_commits_number": 0,                            // minimum number of commits of selected type for analysis start
  "merges_part_in_history": 0.005,                    // part of merge commits in first_parents history (see below)
  
  "task": "name",                                     // current supported task: name extraction
  "parser": "gumtree"                                 // current supported parser: gumtree
  "granularity": "method",                            // current supported granularity: method
  "languages": ["Java", "Python"],                    // current supported languages: Java, Python
  "method_uniqueness": ["full_name", "name", "file", "return_type", "args_types"] // how to check method uniqueness 
  
  "hide_methods_names": true,                         // hides methods names in methods bodies and AST's
  "exclude_constructors": true,                       // exclude constructors from summary

  "min_body_lines_length": 0,                         // minimum lines length of method body 
  "exclude_with_exact_name": ["name1", "name2"],      // methods with these names will not be collected
  "exclude_with_name_prefix": ["test"],               // methods with these prefixes of methods name will not be collected

  "JAVA_exclude_with_annotations": ["@Override"],     // Java methods with these annotations will not be collected

  "max_paths": 1000,                                  // upper bound for number of retrived paths (code2seq) 
  "max_path_width": 2,                                // path max width
  "max_path_length": 9,                               // path max length (number of tokens) 
  "exclude_nodes": [],                                // exclude nodes from AST and code2sec paths
  "exclude_doc_node": true,                           // exclude documentation node from AST adn code2sec paths 
  "ast_dot_format": false,                            // AST dump format: dot or our version (dot with identifiers in nodes)
  "code2sec_format_dump": true                        // dump AST in code2sec format 'method|name node,PATH,node'
}

History processing

If "HISTORY_MODE": false:

data from repositories is extracted without git-history
data from directories / files is always extracted without git-history

If "HISTORY_MODE": true - data extraction based on git-history, analyzer:

loads commit history from default branch of repository
moves from the oldest (first) commit to the newest (last)
for every consecutive pair of commits gets the diff list of files git diff --name-only SHA1 SHA2
filters supported languages files from diff list
if list of files for supported languages isn't empty - makes checkout to current commit git checkout SHA
extracts new methods summary from files if tuple "method_uniqueness": [...] wasn't added before, parameters of tuple:
- "name" - method name, e.g. foo, bar
- "full_name" - method fullname (nesting hierarchy: all parents classes and functions), e.g. MyClass.foo, foo.bar
- "return_type" - method return type
- "args_types" - types of methods arguments (if possible to extract)
- "file" - path to file with method
- if "method_uniqueness": [] - tool extracts all methods without uniqueness check

Two types of history processing depending on the type of commit:

"commits_type": "merges" - history includes merge commits git log --first-parent --merges DEFAULT_BRANCH
"commits_type": "first_parents" - history includes first-parents commits git log --first-parent DEFAULT_BRANCH
both history types include oldest and youngest commits (merge or not merge)
"merges_part_in_history": 0.005 - this is an attempt to distinguish repositories using rebase-based history from merge-based history, e.g. for Kotlin repository:

git log --first-parent --pretty=oneline | wc -l is 66016 first parents commits

git log --first-parent --merges --pretty=oneline | wc -l is 512 merge commits

merges_part_in_history = 512 / 66016 = 0.00776

=> if we set merges_part_in_history = 0.01, then the repository will not be analyzed because the repository value is below the value we set (0.00776 [real value] < 0.01 [value in config])

*set merges_part_in_history = 0.0 if you do not have enough statistics

2. Run

2.1 as code in project

import AnalysisConfig, Analyzer and AnalysisRepository classes
provide a path to analysis_config.json, initialize config and analyser
submitRepo (or submitRepos) any number of repositories for analysis

submitFile (or submitFiles) any number of directories / files for analysis

// config for analysis
val analysisConfig = AnalysisConfig(
    configPath = analysisConfigPath,
    isDebugAnalyzer = isAnalyserDebug, // log in console only submission and done workers statuses 
    isDebugWorkers = isWorkersDebug    // log in console all workers messages
)
val reposUrls = loadJSONList(analysisConfig.reposUrlsPath).parseRepoUrls()
val filesPaths = loadPaths(analysisConfig.filesListPaths)

val reposAnalyzer = Analyzer(config = analysisConfig)

// submit not loaded repo
reposAnalyzer.submitRepo(AnalysisRepository(owner = "owner", name = "name"))
// submit already loaded repo  
reposAnalyzer.submitRepo(AnalysisRepository(path = "repo/path", owner = "owner", name = "name"))
// submit repos from list
reposAnalyzer.submitRepos(reposUrls.map { AnalysisRepository(owner = it.first, name = it.second) })
// submit file
reposAnalyzer.submitFile("my/file.java")
// submit directory
reposAnalyzer.submitFile("my/python/files")
// submit any files or directories  
reposAnalyzer.submitFiles(filesPaths)

// blocking waiting until any worker running  
reposAnalyzer.waitUnitAnyRunning()

2.2 as separate module

provide config paths

file: analysis_config.json

...
"repos_urls_path": repos.json
"files_list_path": files.json
...

repos_urls_path path to .json GitHub repositories list in format .../REPOOWNER/REPONAME (exactly 2 slashes)

file: repos.json

[ "/JetBrains/Kotlin", "/JetBrains/intellij-community"] 
or
["https://github.com/JetBrains/Kotlin", "https://github.com/JetBrains/intellij-community"]

files_list_path - path to .json list with local directories or files paths
```
file: files.json

[ "local/directory", "local/SomeClass.java" ]
```

write own entry point

import analysis.utils.AnalyzerParser

fun main(args: Array<String>) = AnalyzerParser().main(args)

run with script and command line arguments

#!/bin/bash
./gradlew :run --args="--mode=analysis --ad --ac config/analysis_config.json"

arguments

-m, --mode               - work mode: filter, analysis, filter-analysis
--ac, --analysis-config  - path to analysis config .json file
--ad, --analysis-debug   - flag, print all log messages to the console
--wd, --workers-debug    - flag, print all workers messages to the console
-d, --debug              - flag, print all log messages to the console

3. Results

In dump_dir_path appear 4 files:

info.json -- repository / file / dir analysis summary
methods.jsonl -- all methods summary (one method per line)
paths.jsonl -- all retrieved paths for every method (one method per line)
paths.c2s -- all retrieved paths for every method in code2seq file format
commits_log.jsonl -- all consecutive traversed commits pairs (one pair per line)
work_log.txt -- log file

III. Repositories filtering and analysis (module provider)

Filtration + analysis modules

Input: list of urls to existing GitHub repositories, filtration and analysis configs

Output:

lists of 'good' and 'bad' repositories after filters applying with explanation about filters results
for each 'good' repository all summary information extracted with analysis module

1. Run

prepare two .json configs filter_config.json and analysis_config.json
prepare list of GitHub repositories repos_urls_path in filter_config.json (see I, 2)
files with repositories and directories paths lists from analysis_config.json in repos_urls_path and files_list_path will be ignored

Workflow

tool takes repository from file provided in repos_urls_path from filter_config.json
tool applies filters to the repository from filter_config.json
if all filters are successful
- repository url is sent to analysis module
- analysis module downloads repository and retrieves all necessary data with parameters from analysis_config.json
- for each repository tool stores all data in repository folder in dump_dir_path from analysis_config.json

1.1 as code in project

import FilterConfig, AnalysisConfig and FilterAnalyserProvider classes

provide paths to filter_config.json and analysis_config.json, initialize configs and provider

val filterConfig = FilterConfig(configPath = filtrationConfigPath, isDebug = isSearchDebug)
val analysisConfig = AnalysisConfig(
    configPath = analysisConfigPath,
    isDebugAnalyzer = isAnalyserDebug,
    isDebugWorkers = isWorkersDebug
)
val provider = FilterAnalyserProvider(filterConfig = filterConfig, analysisConfig = analysisConfig)
provider.run()

1.2 as separate module

write own entry point

import provider.utils.ProviderParser

fun main(args: Array<String>) = ProviderParser().main(args)

run with script and command line arguments:

#!/bin/bash
./gradlew :run --args="--mode=filter-analysis --fd --ad --fc config/filter_config.json --ac config/analysis_config.json"

arguments:

-m, --mode               - work mode: filter, analysis, filter-analysis
--fc, --filter-config    - path to filtration config .json file
--ac, --analysis-config  - path to analysis config .json file
--fd, --filter-debug     - flag, print all filtration module log messages to the console
--ad, --analysis-debug   - flag, print all analysis module log messages to the console
--wd, --workers-debug    - flag, print all workers messages to the console
-d, --debug              - flag, print all log messages to the console

2. Results

output from filtration module
for each repository output from analysis module to folder dump_folder/data/REPOOWNER__REPONAME

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
config		config
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
README.md		README.md
build.gradle.kts		build.gradle.kts
detekt.yml		detekt.yml
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
run_analyzer.sh		run_analyzer.sh
run_filter.sh		run_filter.sh
run_provider.sh		run_provider.sh
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code summarization dataset

Installation

Tool modules

I. Repositories filtering (filtration module)

1. Filtration config

2. Run

2.1 as code in project

2.2 as separate module

3. Results

4. Search data sources

II. Repositories analysis (analysis module)

1. Analysis config

History processing

2. Run

2.1 as code in project

2.2 as separate module

3. Results

III. Repositories filtering and analysis (module provider)

1. Run

Workflow

1.1 as code in project

1.2 as separate module

2. Results

About

Releases

Packages

Contributors 2

Languages

JetBrains-Research/code-summarization-dataset

Folders and files

Latest commit

History

Repository files navigation

Code summarization dataset

Installation

Tool modules

I. Repositories filtering (filtration module)

1. Filtration config

2. Run

2.1 as code in project

2.2 as separate module

3. Results

4. Search data sources

II. Repositories analysis (analysis module)

1. Analysis config

History processing

2. Run

2.1 as code in project

2.2 as separate module

3. Results

III. Repositories filtering and analysis (module provider)

1. Run

Workflow

1.1 as code in project

1.2 as separate module

2. Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages