Skip to content

secure-software-engineering/TypeEvalPy

Repository files navigation


A Micro-benchmarking Framework for Python Type Inference Tools

📌 Features:

  • 📜 Contains 154 code snippets to test and benchmark.
  • 🏷 Offers 845 type annotations across a diverse set of Python functionalities.
  • 📂 Organized into 18 distinct categories targeting various Python features.
  • 🚢 Seamlessly manages the execution of containerized tools.
  • 🔄 Efficiently transforms inferred types into a standardized format.
  • 📊 Automatically produces meaningful metrics for in-depth assessment and comparison.

🛠️ Supported Tools

Supported ✅ In-progress 🔧 Planned 💡
HeaderGen Intellij PSI MonkeyType
Jedi Pyre Pyannotate
Pyright PySonar2
HiTyper Pytype
Scalpel TypeT5
Type4Py
GPT-4
Ollama


🏆 TypeEvalPy Leaderboard

Below is a comparison showcasing exact matches across different tools, coupled with top_n predictions for ML-based tools.

Rank 🛠️ Tool Top-n Function Return Type Function Parameter Type Local Variable Type Total
1 HeaderGen 1 186 56 322 564
2 Jedi 1 122 0 293 415
3 Pyright 1 100 8 297 405
4 HiTyper 1
3
5
163
173
175
27
37
37
179
225
229
369
435
441
5 HiTyper (static) 1 141 7 102 250
6 Scalpel 1 155 32 6 193
7 Type4Py 1
3
5
39
103
109
19
31
31
99
167
174
157
301
314

(Auto-generated based on the the analysis run on 20 Oct 2023)


🏆🤖 TypeEvalPy LLM Leaderboard

Below is a comparison showcasing exact matches for LLMs.

Rank 🛠️ Tool Function Return Type Function Parameter Type Local Variable Type Total
1 GPT-4 225 85 465 775
2 Finetuned:GPT 3.5 209 85 436 730
3 codellama:13b-instruct 199 75 425 699
4 GPT 3.5 Turbo 188 73 429 690
5 codellama:34b-instruct 190 52 425 667
6 phind-codellama:34b-v2 182 60 399 641
7 codellama:7b-instruct 171 72 384 627
8 dolphin-mistral 184 76 356 616
9 codebooga 186 56 354 596
10 llama2:70b 168 55 342 565
11 HeaderGen 186 56 321 563
12 wizardcoder:13b-python 170 74 317 561
13 llama2:13b 153 40 283 476
14 mistral:instruct 155 45 250 450
15 mistral:v0.2 155 45 248 448
16 vicuna:13b 153 35 260 448
17 vicuna:33b 133 29 267 429
18 Jedi 122 0 293 415
19 Pyright 100 8 297 405
19 wizardcoder:7b-python 103 48 254 405
20 llama2:7b 140 34 216 390
21 HiTyper 163 27 179 369
22 wizardcoder:34b-python 140 43 178 361
23 orca2:7b 117 27 184 328
24 vicuna:7b 131 17 172 320
25 orca2:13b 113 19 166 298
26 Scalpel 155 32 6 193
27 Type4Py 39 19 99 157
28 tinyllama 3 0 23 26
29 phind-codellama:34b-python 5 0 15 20
30 codellama:13b-python 0 0 0 0
31 codellama:34b-python 0 0 0 0
32 codellama:7b-python 0 0 0 0

(Auto-generated based on the the analysis run on 14 Jan 2024)


🐳 Running with Docker

1️⃣ Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

2️⃣ Build Docker image

docker build -t typeevalpy .

3️⃣ Run TypeEvalPy

🕒 Takes about 30mins on first run to build Docker containers.

📂 Results will be generated in the results folder within the root directory of the repository. Each results folder will have a timestamp, allowing you to easily track and compare different runs.

Correlation of CSV Files Generated to Tables in ICSE Paper Here is how the auto-generated CSV tables relate to the paper's tables:
  • Table 1 in the paper is derived from three auto-generated CSV tables:

    • paper_table_1.csv - details Exact matches by type category.
    • paper_table_2.csv - lists Exact matches for 18 micro-benchmark categories.
    • paper_table_3.csv - provides Sound and Complete values for tools.
  • Table 2 in the paper is based on the following CSV table:

    • paper_table_5.csv - shows Exact matches with top_n values for machine learning tools.

Additionally, there are CSV tables that are not included in the paper:

  • paper_table_4.csv - containing Sound and Complete values for 18 micro-benchmark categories.
  • paper_table_6.csv - featuring Sensitivity analysis.
docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy

🔧 Optionally, run analysis on specific tools:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel

🛠️ Available options: headergen, pyright, scalpel, jedi, hityper, type4py, hityperdl

🤖 Running TypeEvalPy with LLMs

TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:

  • Create Configuration File: Copy the config_template.yaml from the src directory and rename it to config.yaml.

In the config.yaml, configure in the following:

  • openai_key: your key for accessing OpenAI's models.
  • ollama_url: the URL for your Ollama instance. For simplicity, we recommend deploying Ollama using their Docker container. Get started with Ollama here.
  • prompt_id: set this to questions_based_2 for optimal performance, based on our tests.
  • ollama_models: select a list of model tags from the Ollama library. For better operation, ensure the model is pre-downloaded with the ollama pull command.

With the config.yaml configured, run the following command:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners ollama

Running From Source...

1. 📥 Installation

  1. Clone the repo

    git clone https://github.com/secure-software-engineering/TypeEvalPy.git
  2. Install Dependencies and Set Up Virtual Environment

    Run the following commands to set up your virtual environment and activate the virtual environment.

    python3 -m venv .env
    source .env/bin/activate
    pip install -r requirements.txt

2. 🚀 Usage: Running the Analysis

  1. Navigate to the src Directory

    cd src
  2. Execute the Analyzer

    Run the following command to start the benchmarking process on all tools:

    python main_runner.py

    or

    Run analysis on specific tools

    python main_runner.py --runners headergen scalpel
    

🤝 Contributing

Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.

To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md


⭐️ Show Your Support

Give a ⭐️ if this project helped you!