This program implements an AI software engineering loop that:
- Reads a software specification prompt
- Generates a Python implementation that satisfies the spec
- Identifies required libraries from the implementation and installs them in a virtual environment
- Creates and runs tests for the implementation
- Analyzes test results
- Iteratively improves the implementation until all tests pass
The program uses Ollama with configurable models to generate code and maintains a memory file to track learnings across iterations. It includes a model evaluator that can test multiple Ollama models to compare their performance.
- Python 3.7+
- Ollama installed and running locally
- Langfuse running locally
- uv (Python package installer and environment manager)
- Clone this repository
- Install the required dependencies:
pip install openai langfuse python-dotenv
- Install uv:
pip install uv
- Make sure Ollama is running with the
deepseek-coder
model:
ollama pull deepseek-coder
ollama run deepseek-coder
- Set up Langfuse environment variables in a
.env
file:
LANGFUSE_SECRET_KEY=your_secret_key
LANGFUSE_PUBLIC_KEY=your_public_key
You can provide the software specification prompt in three ways:
- As a command-line argument:
python ai_engineer.py "Create a function that calculates the factorial of a number"
- As a file containing the prompt:
python ai_engineer.py prompt.txt
- Via standard input:
python ai_engineer.py
# Then type or paste your prompt and press Ctrl+D when finished
- The program reads the software specification prompt.
- It uses Ollama with the configured model to generate an initial implementation with tests.
- It runs the tests and checks if they pass.
- If tests fail, it updates a memory file with learnings from the current iteration.
- It generates an improved implementation based on the original prompt, test results, and memory.
- Steps 3-5 repeat until all tests pass or the maximum number of iterations is reached.
- The evaluator identifies all available Ollama models.
- For each model, it runs multiple evaluations (default: 5 runs per model).
- Each run allows multiple iterations (default: 3 iterations per run).
- Results are organized in a directory structure:
model_evaluations/ ├── model_name_1/ │ ├── 001/ │ │ ├── implementation.py │ │ ├── memory.json │ │ └── output.log │ ├── 002/ │ └── ... ├── model_name_2/ └── ...
- A global results file (
model_evaluation_results.json
) tracks the performance of each model. - The evaluator can be interrupted and resumed at any point.
implementation.py
: The generated Python implementationmemory.json
: A JSON file containing learnings from each iteration
- The program is limited to a maximum of 10 iterations to prevent infinite loops.
- The implementation is limited to a single Python file.
- Test execution has a timeout of 30 seconds to prevent hanging.
You can modify the following constants in the script:
MODEL
: The Ollama model to use (default: "deepseek-coder")MEMORY_FILE
: The file to store memory (default: "memory.json")IMPLEMENTATION_FILE
: The file to store the implementation (default: "implementation.py")MAX_ITERATIONS
: Maximum number of iterations (default: 10)