We can't reliably tell if code is working without executing it, but directly executing untrusted and quite-possibly-buggy code generated by a random LLM directly on our development machines is undesirable.
- Network and Filesystem isolation provider by Docker
- Timeout provider by a bash exec wrapper
- Eval wrapper to capture errors, format outputs and always return a consitent result
sandbox.py
(intended to be used as a library)
extract_function_info(language, code)
perform static analysis on potentially non-working code that implements a function. returns a{ name, args[] }
object containing the functions name and a list of its arguments.FunctionSandbox(code, language)
high-level Docker sandbox class. Use the.call(..)
method to invoke the untrusted function.
timeout.sh
Bash implementation of the timeout layer
Dockerfile.javascript
, Dockerfile.python
Docker implementations of the isolation layer
eval.javascript.tpl
, eval.python.tpl
Eval wrappers.
code
contains a single function inlanguage
- the language we are working with supports try/catch, lists and objects
- the language we are working with can serialize JSON