VulCoCo is a tool for vulnerable code clone detection that combines retrieval-based methods with LLM validation to identify code clones in software repositories.
- Python 3.8+
- Conda package manager
- Anthropic API key (for LLM validation step)
-
Setup Environment
conda env create -f environment.yml conda activate vulcoco
-
Download Source Dataset
Download the source dataset from Google Drive and extract it to your preferred location.
python get_top_repos.pyThis script fetches and clones the top repositories for analysis.
python parse_repos.pyParses the cloned repositories to extract function-level code segments.
python3 main.py --all_json_path 'path/to/source/data.jsonl' \
--funcs_dir 'path/to/function/json/files' \
--clones_dir 'path/to/output/directory' \
--threshold 0.7Parameters:
--all_json_path: Path to the JSONL source dataset--funcs_dir: Directory containing function JSON files from Step 2--clones_dir: Output directory for clone detection results--threshold: Similarity threshold for clone detection (default: 0.7)
python3 llm.py --results 'path/to/clone/results.json' \
--sources 'path/to/source/data.jsonl' \
--api-key 'your-anthropic-api-key' \
--output 'path/to/validated/output.json' \
--responses-dir 'path/to/llm/responses'Parameters:
--results: JSON file containing clone detection results from Step 3--sources: Path to the original JSONL source dataset--api-key: Your Anthropic API key for LLM validation--output: Output path for validated results--responses-dir: Directory to save raw LLM responses
The tool generates:
- Clone detection results in JSON format
- LLM validation responses
- Final validated clone pairs with confidence scores