Skip to content
bjpop edited this page Sep 14, 2010 · 33 revisions

Key implementation ideas

Computational effects

Python statements and expressions are compiled to Haskell expressions with type Eval Object. The Eval type is a monad composed of state, continuations and IO. The state keeps track of the control stack. Continuations provide control flow. IO provides input and output and also mutable values (variables and arrays).

type Eval a = StateT EvalState (ContT Object IO) a

Control flow

Being an imperative language, Python provides a number of control flow operators based on jumps: procedure calls, exception handlers, loops, and generator yields. However, such control flow operators are not (really) primitive in Haskell, so we need to encode them. The continuation monad transformer (ContT) provides first-class continuations and thus a way to implement jumps.

The state contains the control stack, which manages control flow for procedure calls, exception handlers, loops, and generators. The stack itself is essentially a list of frames. Each frame contains one or more continuations which specify potential flows of control. The exact nature of each frame depends on the particular control flow operator that it represents. Certain Python primitives, such as return, raise and yield are compiled to Haskell functions which walk up the stack.

One of the hardest parts of the implementation is making sure the control flow operators work properly together. For example, consider this piece of Python code:

while True:
   try:
      1/0
   except:
      break
   finally:
      continue

It is tricky because the exception handler tries to break, but the finally clause tries to continue. As it happens Python 3.1 rejects the above code because it does not allow continue inside a finally clause. This is one minor case where berp allows something that Python rejects.

Object representation

Objects are represented in berp by one big algebraic data type called, unsurprisingly, Object. It is defined in the module called Berp.Base.SemanticTypes. It has constructors for many special kinds of objects, such as user-defined class instances, types (aka classes), integers, booleans, tuples, lists, dictionaries, functions (aka procedures), strings, generators and the unit type (aka None). There are several more which will be added as berp matures.

We could represent objects in other ways, but a big algebraic data type makes it easy to distinguish different kinds of object very quickly – by pattern matching. This is important for performance. The reason is that we have many overloaded operations in Python, but we can get performance gains by specialising them for certain kinds of objects. For example the plus operator + is overloaded. In the general case its meaning is determined by the __add__ method of a class. However, method lookups are expensive and we want to avoid them where possible. Therefore it is useful for the plus operator to be able to check its arguments to see if they can be specialised.

All Python objects have a unique identity. In CPython the identity happens to be the memory address of the object. This scheme only works if objects do not move about in memory during their lifetime. It also implies that the identity of an object may be reused once the object has been garbage collected. CPython’s approach is expedient, but it does impose some severe restrictions on memory management.

Berp relies on the memory management of GHC, which will move data around during the execution of a program, so we cannot use the memory address for the identity of objects. Instead berp uses a global Integer counter to assign unique identities to objects. The counter is thread safe because it is protected by an MVar. However, we’d like to replace this simple scheme with something more efficient.

Here are a few ideas that might help:

  • Lazy identity creation. We conjecture that many objects never have their identity inspected. If this is true, then it might be better to delay the assignment of identity until the moment it is first needed, if at all. More profiling on real programs is needed to decide if this is really worthwhile.
  • Thread local identity. When we add concurrency to berp, we could give each thread some local state. Part of the state could be a counter for object identity. The full identity of an object would then be a pair (ThreadId, ObjectID). The goal is to avoid threads from competing for the lock on a global MVar.
  • Special case identity. We might be able to treat the identity of certain types of objects in special cases. The obvious example is Integers. Their value makes them unique amongst other Integers. We could represent the identity type with an algebraic data type, which has a constructor for integers, and a constructor for the general case. Perhaps other primitive types could also be treated as special cases.

Variables

Python has mutable variables, or to be more specific, Python variables can be assigned to a particular object, and later re-assigned to another object, and so on. The obvious way to represent a Python variable in Haskell is to use an IORef, and that is what berp does. However, there are situations where ordinary Haskell variables would suffice – and they would be cheaper to use. For example, consider this function:

def euclideanDistance (x1,y1,x2,y2):
    xDelta = x1 - x2
    yDelta = y1 - y2
    return math.sqrt(xDelta * xDelta + yDelta * yDelta)

All of the local variables in euclideanDistance are assigned to an object exactly once – they are never mutated. We could compile them to ordinary Haskell variables, and the resulting code would be more efficient (in both time and space, because each IORef is are allocated on the heap). It should be quite straightforward to analyse the usage of local variables and determine, perhaps conservatively, which ones are assigned at most once.

Status of the implementation

Implemented features

These features are implemented to a reasonable (but perhaps not complete) degree in the development version:

  • Empty statements (pass).
  • Assignment (including multiple assignment of the form a, b = 0, 1).
  • Procedures (def, lambda, function calls, return).
  • Conditional statements (if, elif and else).
  • Loops (for, while, break and continue).
  • Generators (yield).
  • Exceptions (try, except, finally and raise).
  • Classes (class, attribute lookup via the dot operator).
  • Arithmetic and logic operators.
  • Method resolution order (the C3 MRO algorithm).
  • Primitive types: lists, tuples, strings, integers, booleans, dictionaries, generators, none.
  • Interactive I/O (print and input).
  • Subscription.

Unimplemented features

These features are not implemented at all in the current version:

  • Module imports.
  • Keyword and default arguments.
  • Comprehensions.
  • The eval function.
  • Metaclasses.
  • Decorators.
  • Augmented assignment (+= etcetera).
  • File I/O.
  • Variable scope annotations (global and nonlocal).

Translation scheme

As an example, consider this simple Python implementation of the factorial function:

def fac(n, acc):
    if n == 0:
       return acc
    else:
       return fac(n-1, n*acc)

print(fac(1000, 1))

Given the above Python code as input, Berp will produce the following Haskell code as output:

module Main where
import Berp.Base
import qualified Prelude
main = runStmt init
init
  = do _s_fac <- var "fac"
       def _s_fac 2 none
         (\ [_s_n, _s_acc] ->
            ifThenElse
              (do _t_6 <- read _s_n
                  _t_6 == 0)
              (do _t_7 <- read _s_acc
                  ret _t_7)
              (do _t_0 <- read _s_fac
                  _t_1 <- read _s_n
                  _t_2 <- _t_1 - 1
                  _t_3 <- read _s_n
                  _t_4 <- read _s_acc
                  _t_5 <- _t_3 * _t_4
                  tailCall _t_0 [_t_2, _t_5]))
       _t_8 <- read _s_print
       _t_9 <- read _s_fac
       _t_10 <- _t_9 @@ [1000, 1]
       _t_8 @@ [_t_10]

Even without knowing the details of the translation you ought to be able to see that there is a fairly strong correspondence between the input Python code and the output Haskell code. Most of the tricky work on the Haskell side is done by the Berp runtime library which implements all the primitive functions such as def, while etcetera.

There is plenty of room for improvement in the generated code. For example, you can see that some temporary variables are read multiple times redundantly. The focus of the implementation effort so far has been to get berp working rather than to make it go fast. We will attempt to improve the generated code when the rest of the system is more stable.

Issues