# DS-GA-3001 Advanced Python for Data Science

Before you turn this problem in, make sure you **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart). You can then run the cells **in order**, during the class.

Any textual answers that need to be provided will be marked with "YOUR ANSWER HERE". Replace this text with your answer to the question.

Any code answers that need to be provided will be marked with:

```
# YOUR CODE HERE
raise NotImplementedError()
```

Replace all this code with your answer to the question. If you do not answer the question, the `NotImplementedError` exception will be raised, which will indicate to the grader that no answer has been supplied.

In many cases, code answers will also have some associated test code. You should execute the tests after you have entered your code in order to ensure that your answer is correct. You should not proceed to the next question until your answer is correct.

Finally, insert your Net ID and the Net ID's of any collaborators in the cell below.

In [1]:
NET_ID = "jl6583"
COLLABORATORS = ""

---

# An Introduction to Cython

Cython is a modification of Python that adds C data types. Almost any piece of Python code is also valid Cython code (with a few [limitations](http://docs.cython.org/src/userguide/limitations.html#cython-limitations).) Cython then converts the (modified) Python code into C code which makes equivalent calls to the Python/C API. This C code is then compiled into a shared library which can be imported into Python.

In Cython, function parameters and variables can be declared to have C data types, and code which manipulates Python values and C values can be freely intermixed. Cython takes care of converting from C to Python data types automatically wherever possible. Reference count maintenance and error checking of Python operations is also automatic, and the full power of Python’s exception handling facilities, including the try-except and try-finally statements, is still available.

There are two main benefits of Cython:

1. **Speed.** How much performance improves depends very much on the program. Typical Python numerical programs would tend to gain very little as most time is spent in lower-level C anyway. However, for-loop-style programs can improve by many orders of magnitude.
2. **Easy calling into C code.** One of Cython’s purposes is to allow easy wrapping of C libraries. When writing code in Cython you can call into C code as easily as into Python code.

The following sections provide a very brief introduction to Cython. See [Cython Language Basics](http://docs.cython.org/src/userguide/language_basics.html#language-basics) for a more detailed description of the Cython language.

## Cython Syntax

### Basic C Types

Cython supports most C data types. The following table lists the most common types.

<p>
<table align="left">
<tr><td align="center"><b>Type</b></td><td align="center"><b>Description</b></td></tr>
<tr><td>`char`</td><td>8 bit integer</td></tr>
<tr><td>`short`</td><td>16 bit integer</td></tr>
<tr><td>`int`</td><td>32 bit integer</td></tr>
<tr><td>`long`</td><td>64 bit integer</td></tr>
<tr><td>`long long`</td><td>64 bit integer</td></tr>
<tr><td>`float`</td><td>32 bit floating point</td></tr>
<tr><td>`double`</td><td>64 bit floating point</td></tr>
<tr><td>`long double`</td><td>80 bit floating point</td></tr>
<tr><td>`float a[10][30]`</td><td>2-dimensional array</td></tr>
<tr><td>`char *s`</td><td>pointer</td></tr>
<tr><td>`struct foo`</td><td>structure</td></tr>
<tr><td>`union bar`</td><td>union</td></tr>
<tr><td>`enum type`</td><td>enumeration</td></tr>
</table>
</p>

### Variable and Type Definitions

The `cdef` statement is used to declare C variables, either local or module-level:

```cython
cdef int i, j, k
cdef float f, g[42], *h
```

In C, types can be given names using the `typedef` statement. The equivalent in Cython is `ctypedef`:

```cython
ctypedef int * intPtr
```

Cython also supports C `struct`, `union`, or `enum` types:

<p>
<table align="left">
<tr><td align="center"><b>C code</b></td><td align="center"><b>Cython code</b></td></tr>
<tr><td>
```
struct Grail {
    int age;
    float volume;
}
```
</td><td>
```cython
cdef struct Grail:
    int age
    float volume
```
</td></tr>
<tr><td>
```
union Food {
    char *spam;
    float *eggs;
}
```
</td><td>
```
cdef union Food:
    char *spam
    float *eggs
```
</td></tr>
<tr><td>
```
enum CheeseType {
    cheddar, edam,
    camembert
}
```
</td><td>
```
cdef enum CheeseType:
    cheddar, edam,
    camembert
```
</td></tr>
<tr><td>
```
emum CheeseState {
    hard = 1,
    soft = 2,
    runny = 3
}
```
</td><td>
```
cdef enum CheeseState:
    hard = 1
    soft = 2
    runny = 3
```
</td></tr>
</table>
</p>


### Functions

There are two kinds of function definition in Cython:

* **Python functions** are defined using the `def` statement, as in Python. They take Python objects as parameters and return Python objects.

* **C functions** are defined using the new `cdef` statement. They take either Python objects or C values as parameters, and can return either Python objects or C values.

Within a Cython module, Python functions and C functions can call each other freely, but only Python functions can be called from outside the module by interpreted Python code. So, any functions that you want to “export” from your Cython module must be declared as Python functions using `def`. There is also a hybrid function, called `cpdef`. A `cpdef` can be called from anywhere, but uses the faster C calling conventions when being called from other Cython code. A `cpdef` can also be overridden by a Python method on a subclass or an instance attribute, even when called from Cython. If this happens, most performance gains are of course lost and even if it does not, there is a tiny overhead in calling a `cpdef` method from Cython compared to calling a `cdef` method.

Parameters of either type of function can be declared to have C data types, using normal C declaration syntax. For example:

```cython
def spam(int i, char *s):
    ...

cdef int eggs(unsigned long l, float f):
    ...
```

Automatic conversion is currently only possible for numeric types, string types and structs (composed recursively of any of these types); attempting to use any other type for the parameter of a Python function will result in a compile-time error. Care must be taken with strings to ensure a reference if the pointer is to be used after the call. Structs can be obtained from Python mappings, and again care must be taken with string attributes if they are to be used after the function returns.

C functions, on the other hand, can have parameters of any type, since they’re passed in directly using a normal C function call.

Functions declared using `cdef`, like Python functions, **will return a False value when execution leaves the function body without an explicit return value**. This is in contrast to C/C++, which leaves the return value undefined.

### Automatic Type Conversions

In most situations, automatic conversions will be performed for the basic numeric and string types when a Python object is used in a context requiring a C value, or vice versa. The following table summarises the conversion possibilities.

<table align="left">
<tr><td><b>C types</b></td><td><b>From Python types</b></td><td><b>To Python types</b></td></tr>
<tr><td>`char, short, int, long`</td><td>`int, long`</td><td>`int`</td></tr>
<tr><td>`int, long, long long`</td><td>`int, long`</td><td>`long`</td></tr>
<tr><td>`float, double, long double`</td><td>`int, long, float`</td><td>`float`</td></tr>
<tr><td>`char*`</td><td>`str`</td><td>`str`</td></tr>
<tr><td>`struct, union`</td><td></td><td>`dict`</td></tr>
</table>

### Statements and Expressions

Control structures and expressions follow Python syntax for the most part. When applied to Python objects, they have the same semantics as in Python (unless otherwise noted). Most of the Python operators can also be applied to C values, with the obvious semantics.

If Python objects and C values are mixed in an expression, conversions are performed automatically between Python objects and C numeric or string types.

## Executing Cython Code

### Manual Compiliation

Cython code is normally saved in files ending with `.pyx` (the `x` indicates it is different from standard Python code.) A Cython file must be translated to C using the command:

```
cython my_module.pyx
```

This will create a file called `my_module.c` which is the C source for a Python extension module. A useful additional switch is **-a which will generate an HTML document (`my_module.html`) that shows which Cython code translates to which C code line by line.**

Once the C file has been generated, it must be compiled into a shared library. This may vary according to the operating system, but for Linux it would be something like:

```
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 \
        -o my_module.so my_module.c
```

This command will create a library called `my_module.so`. This library can be treated just like any Python module and imported using the normal import statement:

```
import my_module
```

### A Simpler Way

Cython can be used conveniently and interactively from a web browser through the IPython notebook.

To enable support for Cython compilation, install Cython and load the Cython extension from within IPython:

In [2]:
%load_ext Cython

Cython code can now be **compiled using the `%%cython` cell** magic command:

In [3]:
%%cython

def cfunc(int n):
    cdef int a = 0
    for i in range(n):
        a += i
    return a

In [4]:
print cfunc(10)

45


It is also possible to see Cython's code analysis using the `--annotate` option.

In [5]:
%%cython --annotate

def cfunc(int n):
    cdef int a = 0
    for i in range(n):
        a += i
    return a

## A Simple Example

The following pure Python example generates a list of `kmax` prime numbers.

In [6]:
def primes(kmax):
    p = [None] * 1000 # Initialize the list to the max number of elements
    if kmax > 1000:
        kmax = 1000
    result = []
    k = 0
    n = 2
    while k < kmax:
        i = 0
        while i < k and n % p[i] != 0:
            i = i + 1
        if i == k:
            p[k] = n
            k = k + 1
            result.append(n)
        n = n + 1
    return result

In [7]:
%timeit primes(1000)

10 loops, best of 3: 83.6 ms per loop


This code can be run without any changes in Cython. The simplest way to do this is by using the `%%cython` cell magic:

In [8]:
%%cython
def cprimes(kmax):
    p = [None] * 1000
    if kmax > 1000:
        kmax = 1000
    result = []
    k = 0
    n = 2
    while k < kmax:
        i = 0
        while i < k and n % p[i] != 0:
            i = i + 1
        if i == k:
            p[k] = n
            k = k + 1
            result.append(n)
        n = n + 1
    return result

In [9]:
%timeit cprimes(1000)

10 loops, best of 3: 41.1 ms per loop


<div class="alert alert-success">
As you can see, this improved the performance of the pure Python implementation. But can we do better? Let's try adding some types. Uncomment the lines below to see how this affects the execution.
</div>

In [5]:
%%cython
def cprimes2(int kmax):
    cdef int p[1000] 
    if (kmax > 1000): 
        kmax = 1000
    result = []
    cdef int k = 0
    cdef int n = 0
    cdef int i = 0
    k = 0
    n = 2
    while k < kmax:
        i = 0
        while i < k and n % p[i] != 0:
            i = i+1
        if i == k:
            p[k] = n
            k = k + 1
            result.append(n)
        n += 1
    return result
    
        


Error compiling Cython file:
------------------------------------------------------------
...
def cprimes2(int kmax):
    cdef int p[1000]
    cdef int kmax = 0
            ^
------------------------------------------------------------

/Users/luchristopher/.ipython/cython/_cython_magic_37017c4e0af6e5b25595fbedac9518e2.pyx:3:13: 'kmax' redeclared 


In [4]:
%timeit cprimes2(1000)

100 loops, best of 3: 2.75 ms per loop


Wow, that made a big difference!

## Cython For NumPy Users

NumPy can be used from Cython in exactly the same manner as in regular Python, however Cython also has a number of features that support fast access to NumPy arrays that can result in significant performance gains. In this section, we will look at how some of these features can be used.

The code below does 2D discrete convolution of an image with a filter.

In [12]:
import numpy as np
def naive_convolve(f, g):
    # f is an image and is indexed by (v, w)
    # g is a filter kernel and is indexed by (s, t),
    #   it needs odd dimensions
    # h is the output image and is indexed by (x, y),
    #   it is not cropped
    if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
        raise ValueError("Only odd dimensions on filter supported")
    # smid and tmid are number of pixels between the center pixel
    # and the edge, ie for a 5x5 filter they will be 2.
    #
    # The output size is calculated by adding smid, tmid to each
    # side of the dimensions of the input image.
    vmax = f.shape[0]
    wmax = f.shape[1]
    smax = g.shape[0]
    tmax = g.shape[1]
    smid = smax // 2
    tmid = tmax // 2
    xmax = vmax + 2*smid
    ymax = wmax + 2*tmid
    # Allocate result image.
    h = np.zeros([xmax, ymax], dtype=f.dtype)
    # Do convolution
    for x in range(xmax):
        for y in range(ymax):
            # Calculate pixel value for h at (x,y). Sum one component
            # for each pixel (s, t) of the filter g.
            s_from = max(smid - x, -smid)
            s_to = min((xmax - x) - smid, smid + 1)
            t_from = max(tmid - y, -tmid)
            t_to = min((ymax - y) - tmid, tmid + 1)
            value = 0
            for s in range(s_from, s_to):
                for t in range(t_from, t_to):
                    v = x - smid + s
                    w = y - tmid + t
                    value += g[smid - s, tmid - t] * f[v, w]
            h[x, y] = value
    return h

Let's get a baseline on how fast this code executes:

In [13]:
%timeit naive_convolve(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))

10000 loops, best of 3: 36.1 µs per loop


<div class="alert alert-success">
As we saw previously, we can simply compile this code using Cython and expect some performance improvements. Copy the `naive_convolve` function from the cell above, then rename the function `convolve1` and compile it with Cython.
</div>

In [14]:
%%cython
import numpy as np
def convolve1(f, g):
    # f is an image and is indexed by (v, w)
    # g is a filter kernel and is indexed by (s, t),
    #   it needs odd dimensions
    # h is the output image and is indexed by (x, y),
    #   it is not cropped
    if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
        raise ValueError("Only odd dimensions on filter supported")
    # smid and tmid are number of pixels between the center pixel
    # and the edge, ie for a 5x5 filter they will be 2.
    #
    # The output size is calculated by adding smid, tmid to each
    # side of the dimensions of the input image.
    vmax = f.shape[0]
    wmax = f.shape[1]
    smax = g.shape[0]
    tmax = g.shape[1]
    smid = smax // 2
    tmid = tmax // 2
    xmax = vmax + 2*smid
    ymax = wmax + 2*tmid
    # Allocate result image.
    h = np.zeros([xmax, ymax], dtype=f.dtype)
    # Do convolution
    for x in range(xmax):
        for y in range(ymax):
            # Calculate pixel value for h at (x,y). Sum one component
            # for each pixel (s, t) of the filter g.
            s_from = max(smid - x, -smid)
            s_to = min((xmax - x) - smid, smid + 1)
            t_from = max(tmid - y, -tmid)
            t_to = min((ymax - y) - tmid, tmid + 1)
            value = 0
            for s in range(s_from, s_to):
                for t in range(t_from, t_to):
                    v = x - smid + s
                    w = y - tmid + t
                    value += g[smid - s, tmid - t] * f[v, w]
            h[x, y] = value
    return h

In [15]:
from nose.tools import assert_equal
res1 = naive_convolve(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))
res2 = convolve1(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))
assert_equal((res1==res2).all(), True)

<div class="alert alert-success">
Now time the new function and see if it is any faster.
</div>

In [16]:
%timeit convolve1(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))

10000 loops, best of 3: 21.5 µs per loop


<div class="alert alert-success">
The next step is to add Cython data types to the code. This code will no longer be compatible with Python, so the consequences of doing this must be carefully considered. The most important change is to use variables that have the same data type as the elements of the NumPy arrays. 
</div><div class="alert alert-success">
The code below will not compile as parts have been commented out. Read each of the commented sections for a description of how the data types are added, then uncomment the line to enable to statement. When you have completed this for the whole function, run the cell to ensure that it compiles correctly. 
</div>

In [17]:
%%cython
import numpy as np
cimport numpy as np
DTYPE = np.int #np.int is a alias to int() function
ctypedef np.int_t DTYPE_t
def convolve2(f, g):
    # f is an image and is indexed by (v, w)
    # g is a filter kernel and is indexed by (s, t),
    #   it needs odd dimensions
    # h is the output image and is indexed by (x, y),
    #   it is not cropped
    if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
        raise ValueError("Only odd dimensions on filter supported")
    # smid and tmid are number of pixels between the center pixel
    # and the edge, ie for a 5x5 filter they will be 2.
    #
    # The output size is calculated by adding smid, tmid to each
    # side of the dimensions of the input image.
    cdef int vmax = f.shape[0]
    cdef int wmax = f.shape[1]
    cdef int smax = g.shape[0]
    cdef int tmax = g.shape[1]
    cdef int smid = smax // 2
    cdef int tmid = tmax // 2
    cdef int xmax = vmax + 2*smid
    cdef int ymax = wmax + 2*tmid
    cdef np.ndarray h = np.zeros([xmax, ymax], dtype = DTYPE)
    cdef int x, y, s_from, s_to, t_from, t_to, s, t, v, w
    cdef DTYPE_t value
    # Allocate result image.
    # Do convolution
    for x in range(xmax):
        for y in range(ymax):
            # Calculate pixel value for h at (x,y). Sum one component
            # for each pixel (s, t) of the filter g.
            s_from = max(smid - x, -smid)
            s_to = min((xmax - x) - smid, smid + 1)
            t_from = max(tmid - y, -tmid)
            t_to = min((ymax - y) - tmid, tmid + 1)
            value = 0
            for s in range(s_from, s_to):
                for t in range(t_from, t_to):
                    v = x - smid + s
                    w = y - tmid + t
                    value += g[smid - s, tmid - t] * f[v, w]
            h[x, y] = value
    return h

In [18]:
from nose.tools import assert_equal
res1 = naive_convolve(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))
res2 = convolve2(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))
assert_equal((res1==res2).all(), True)

Now time this code and see if it has improved.

In [19]:
%timeit convolve2(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))

100000 loops, best of 3: 12.8 µs per loop


### Efficient Indexing

This code is still not as efficient as it could be. Array lookups and assignments, like those using the []-operator, still uses full Python operations. It would be much more effient if we could access the data buffer directly at C speed.

It is possible to do this by specifying the type of contents of the `ndarray` objects. We do this with a special “buffer” syntax which must be told the datatype (first argument) and number of dimensions (“ndim” keyword-only argument, if not provided then one-dimensional is assumed).

The changes that need to be made to the previous code are as follows:

<pre>
<code>
...
def convolve2(np.ndarray<b>[DTYPE_t, ndim=2]</b> f, np.ndarray<b>[DTYPE_t, ndim=2]</b> g):
...
cdef np.ndarray<b>[DTYPE_t, ndim=2]</b> h = ...
</code>
</pre>

Now make these changes to the `convolve3` function below and time the result.

In [20]:
%%cython
import numpy as np
cimport numpy as np
DTYPE = np.int #np.int is a alias to int() function
ctypedef np.int_t DTYPE_t
def convolve3(np.ndarray[DTYPE_t, ndim=2] f, np.ndarray[DTYPE_t, ndim=2] g):
    # f is an image and is indexed by (v, w)
    # g is a filter kernel and is indexed by (s, t),
    #   it needs odd dimensions
    # h is the output image and is indexed by (x, y),
    #   it is not cropped
    if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
        raise ValueError("Only odd dimensions on filter supported")
    # smid and tmid are number of pixels between the center pixel
    # and the edge, ie for a 5x5 filter they will be 2.
    #
    # The output size is calculated by adding smid, tmid to each
    # side of the dimensions of the input image.
    cdef int vmax = f.shape[0]
    cdef int wmax = f.shape[1]
    cdef int smax = g.shape[0]
    cdef int tmax = g.shape[1]
    cdef int smid = smax // 2
    cdef int tmid = tmax // 2
    cdef int xmax = vmax + 2*smid
    cdef int ymax = wmax + 2*tmid
    cdef np.ndarray[DTYPE_t, ndim=2] h = np.zeros([xmax, ymax], dtype = DTYPE)
    cdef int x, y, s_from, s_to, t_from, t_to, s, t, v, w
    cdef DTYPE_t value
    # Allocate result image.
    # Do convolution
    for x in range(xmax):
        for y in range(ymax):
            # Calculate pixel value for h at (x,y). Sum one component
            # for each pixel (s, t) of the filter g.
            s_from = max(smid - x, -smid)
            s_to = min((xmax - x) - smid, smid + 1)
            t_from = max(tmid - y, -tmid)
            t_to = min((ymax - y) - tmid, tmid + 1)
            value = 0
            for s in range(s_from, s_to):
                for t in range(t_from, t_to):
                    v = x - smid + s
                    w = y - tmid + t
                    value += g[smid - s,tmid - t] * f[v,w] #don't use c-style indexing [][], for some reason it is slow
            h[x,y] = value
    return h

In [21]:
from nose.tools import assert_equal, assert_less
import timeit
def run1():
    return convolve2(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))
def run2():
    return convolve3(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))
res1 = run1()
res2 = run2()
t1 = timeit.timeit(run1, number=10000)
t2 = timeit.timeit(run2, number=10000)
assert_less(t2, t1)
assert_equal((res1==res2).all(), True)
print "convolve3 is faster than convolve2 (%f < %f)!" % (t2, t1)

convolve3 is faster than convolve2 (0.097054 < 0.146748)!


### More Indexing Improvements

The NumPy array lookups are still slowed down by two other factors:

1. Bounds checking is performed.

2. Negative indices are checked for and handled correctly. 

If we are certain that code will always access within the array bounds, and that it doesn’t use negative indices, then it is possible to some extra performance by avoiding these checks.

<div class="alert alert-danger">
Note however that this comes at the cost of safety. If the code does not behave exactly as you expect, it could crash the program or corrupt data.
</div>

Bounds checking can be disabled by adding a decorator to the function as follows:

```cython
...
cimport cython
@cython.boundscheck(False) # turn off bounds-checking for entire function
def convolve3(np.ndarray[DTYPE_t, ndim=2] f, np.ndarray[DTYPE_t, ndim=2] g):
...
```

Now bounds checking is not performed. It is possible to switch bounds-checking mode in many ways, see [Compiler directives](http://docs.cython.org/src/reference/compilation.html#compiler-directives) for more information.

Negative indices are dealt with by forcing the indexes to be positive using the unsigned integer type for the index variables and casting values to this type. If negative values are cast, then this will create a very large positive value instead and it may result in an attempt to access out-of-bounds values. Casting is done with a special <>-syntax. The code below shows how to change the function to use either unsigned ints or casting as appropriate:

```cython
...
cdef int s, t
cdef unsigned int x, y, v, w
...
               v = <unsigned int>(x - smid + s)
               w = <unsigned int>(y - tmid + t)
               value += g[<unsigned int>(smid - s), <unsigned int>(tmid - t)] * f[v, w]
...
```

<div class="alert alert-success">
Make the changes for the bounds checking and negative indices to the following code and compare how it performs with the other versions.
</div>

In [22]:
%%cython
import numpy as np
cimport numpy as np
cimport cython
ctypedef np.int_t DTYPE_t
@cython.boundscheck(False)
def convolve4(np.ndarray[DTYPE_t, ndim=2] f, np.ndarray[DTYPE_t, ndim=2] g):
    cdef unsigned short vmax, wmax, smax, tmax, xmax, ymax
    cdef short smid, tmid
    # f is an image and is indexed by (v, w)
    # g is a filter kernel and is indexed by (s, t),
    #   it needs odd dimensions
    # h is the output image and is indexed by (x, y),
    #   it is not cropped
    if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
        raise ValueError("Only odd dimensions on filter supported")
    # smid and tmid are number of pixels between the center pixel
    # and the edge, ie for a 5x5 filter they will be 2.
    #
    # The output size is calculated by adding smid, tmid to each
    # side of the dimensions of the input image.
    vmax = f.shape[0]
    wmax = f.shape[1]
    smax = g.shape[0]
    tmax = g.shape[1]
    smid = smax // 2
    tmid = tmax // 2
    xmax = vmax + 2*smid
    ymax = wmax + 2*tmid
    cdef np.ndarray[DTYPE_t, ndim=2] h = np.zeros([xmax, ymax], dtype = f.dtype)
    cdef unsigned short x, y
    cdef short s_from, s_to, t_from, t_to ,s , t, v, w
    cdef DTYPE_t value
    # Allocate result image.
    # Do convolution
    for x in range(xmax):
        for y in range(ymax):
            # Calculate pixel value for h at (x,y). Sum one component
            # for each pixel (s, t) of the filter g.
            s_from = max(smid - x, -smid)
            s_to = min((xmax - x) - smid, smid + 1)
            t_from = max(tmid - y, -tmid)
            t_to = min((ymax - y) - tmid, tmid + 1)
            value = 0
            for s in range(s_from, s_to):
                for t in range(t_from, t_to):
                    v = x - smid + s
                    w = y - tmid + t
                    value += g[smid - s,tmid - t] * f[v,w] #don't use c-style indexing [][], for some reason it is slow
            h[x,y] = value
    return h

In [23]:
from nose.tools import assert_equal, assert_less
import timeit
def run1():
    return convolve2(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))
def run2():
    return convolve4(np.array([[1, 1, 1]], dtype=np.int), np.array([[1],[2],[1]], dtype=np.int))

res1 = run1()
res2 = run2()

t1 = timeit.timeit(run1, number=10000)
t2 = timeit.timeit(run2, number=10000)
assert_less(t2, t1)
assert_equal((res1==res2).all(), True)
print "convolve4 is faster than convolve2 (%f < %f)!" % (t2, t1)

convolve4 is faster than convolve2 (0.085978 < 0.134839)!


## Conclusion

To summarize what we learned:
* A new syntax for specifying C-like data types
* Running pure Python code with Cython improves performance
* Modifying Python code to add type information helps Cython optimize the code further
* Cython provides compile-time information that helps speed up NumPy programs
* Using type information when manipulating NumPy arrays boosts performance
* Removing safety checks on array bounds and negative indexes can speed performance at the expense of safety