# DS-GA-3001 Advanced Python for Data Science
## Lab 7 – Introduction to Cython

Based on 2016 lecture for this class, prepared by Greg Watson and revised by Maria Elena Villalobos.
Most of the content was extracted from the official [Cython documentation and tutorials](http://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html). 

Before you turn this problem in, make sure you **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart). You can then run the cells **in order**, during the class.

Any textual answers that need to be provided will be marked with "YOUR ANSWER HERE". Replace this text with your answer to the question.

Any code answers that need to be provided will be marked with:

```
# YOUR CODE HERE
raise NotImplementedError()
```

Replace all this code with your answer to the question. If you do not answer the question, the `NotImplementedError` exception will be raised.

In many cases, code answers will also have some associated test code. You should execute the tests after you have entered your code in order to ensure that your answer is correct. You should not proceed to the next question until your answer is correct.

---

# An Introduction to Cython

The fundamental nature of Cython can be summed up as follows: **Cython is Python with C data types**.

## Who uses Cython?

Cython is particularly popular among scientific users of Python, where it has "the perfect audience" according to Python developer Guido van Rossum.[17] Of particular note:

- The free software SageMath computer algebra system depends on Cython, both for performance and to interface with other libraries.
- Significant parts of the scientific computing libraries **SciPy, pandas and scikit-learn** are written in Cython.
- Some high traffic websites such as **Quora** use Cython.
Cython's domain is not limited to just numerical computing. For example, the lxml XML toolkit is written mostly in Cython, and like its predecessor Pyrex, Cython is used to provide Python bindings for many C and C++ libraries like the messaging library ZeroMQ. Cython can also be used to develop parallel programs for multi-core processor machines; this feature makes use of the OpenMP library. (from [Wikipedia](https://en.wikipedia.org/wiki/Cython))

## How does it work?
Cython is a modification of Python that adds C data types. Almost any piece of Python code is also valid Cython code (with a few [limitations](http://docs.cython.org/src/userguide/limitations.html#cython-limitations).) Cython then converts the (modified) Python code into C code which makes equivalent calls to the Python/C API. This C code is then compiled into a shared library which can be imported into Python.

In Cython, function parameters and variables can be declared to have C data types, and code which manipulates Python values and C values can be freely intermixed. Cython takes care of converting from C to Python data types automatically wherever possible. Reference count maintenance and error checking of Python operations is also automatic, and the full power of Python’s exception handling facilities, including the try-except and try-finally statements, is still available.

## Benefits of Cython
There are two main benefits of Cython:

1. **Speed.** How much performance improves depends very much on the program. Typical Python numerical programs would tend to gain very little as most time is spent in lower-level C anyway. However, for-loop-style programs can improve by many orders of magnitude.
2. **Easy calling into C code.** One of Cython’s purposes is to allow easy wrapping of C libraries. When writing code in Cython you can call into C code as easily as into Python code.

The following sections provide a very brief introduction to Cython. See [Cython Language Basics](http://docs.cython.org/src/userguide/language_basics.html#language-basics) for a more detailed description of the Cython language.

## Cython Syntax

### Basic C Types

Cython supports most C data types. The following table lists the most common types.

<p>
<table align="left">
<tr><td align="center"><b>Type</b></td><td align="center"><b>Description</b></td></tr>
<tr><td>`char`</td><td>8 bit integer</td></tr>
<tr><td>`short`</td><td>16 bit integer</td></tr>
<tr><td>`int`</td><td>32 bit integer</td></tr>
<tr><td>`long`</td><td>64 bit integer</td></tr>
<tr><td>`long long`</td><td>64 bit integer</td></tr>
<tr><td>`float`</td><td>32 bit floating point</td></tr>
<tr><td>`double`</td><td>64 bit floating point</td></tr>
<tr><td>`long double`</td><td>80 bit floating point</td></tr>
<tr><td>`float a[10][30]`</td><td>2-dimensional array</td></tr>
<tr><td>`char *s`</td><td>pointer</td></tr>
<tr><td>`struct foo`</td><td>structure</td></tr>
<tr><td>`union bar`</td><td>union</td></tr>
<tr><td>`enum type`</td><td>enumeration</td></tr>
</table>
</p>

### Variable and Type Definitions

The `cdef` statement is used to declare C variables, either local or module-level:

```cython
cdef int i, j, k
cdef float f, g[42], *h
```

In C, types can be given names using the `typedef` statement. The equivalent in Cython is `ctypedef`:

```cython
ctypedef int * intPtr
```

Cython also supports C `struct`, `union`, or `enum` types:

<p>
<table align="left">
<tr><td align="center"><b>C code</b></td><td align="center"><b>Cython code</b></td></tr>
<tr><td>
```
struct Grail {
    int age;
    float volume;
}
```
</td><td>
```cython
cdef struct Grail:
    int age
    float volume
```
</td></tr>
<tr><td>
```
union Food {
    char *spam;
    float *eggs;
}
```
</td><td>
```
cdef union Food:
    char *spam
    float *eggs
```
</td></tr>
<tr><td>
```
enum CheeseType {
    cheddar, edam,
    camembert
}
```
</td><td>
```
cdef enum CheeseType:
    cheddar, edam,
    camembert
```
</td></tr>
<tr><td>
```
emum CheeseState {
    hard = 1,
    soft = 2,
    runny = 3
}
```
</td><td>
```
cdef enum CheeseState:
    hard = 1
    soft = 2
    runny = 3
```
</td></tr>
</table>
</p>


### Functions

There are two kinds of function definition in Cython:

* **Python functions** are defined using the `def` statement, as in Python. They take Python objects as parameters and return Python objects.

* **C functions** are defined using the new `cdef` statement. They take either Python objects or C values as parameters, and can return either Python objects or C values.

Within a Cython module, Python functions and C functions can call each other freely, but only Python functions can be called from outside the module by interpreted Python code. So, any functions that you want to “export” from your Cython module must be declared as Python functions using `def`. There is also a hybrid function, called `cpdef`. A `cpdef` can be called from anywhere, but uses the faster C calling conventions when being called from other Cython code. A `cpdef` can also be overridden by a Python method on a subclass or an instance attribute, even when called from Cython. If this happens, most performance gains are of course lost and even if it does not, there is a tiny overhead in calling a `cpdef` method from Cython compared to calling a `cdef` method.

Parameters of either type of function can be declared to have C data types, using normal C declaration syntax. For example:

```cython
def spam(int i, char *s):
    ...

cdef int eggs(unsigned long l, float f):
    ...
```

Automatic conversion is currently only possible for numeric types, string types and structs (composed recursively of any of these types); attempting to use any other type for the parameter of a Python function will result in a compile-time error. Care must be taken with strings to ensure a reference if the pointer is to be used after the call. Structs can be obtained from Python mappings, and again care must be taken with string attributes if they are to be used after the function returns.

C functions, on the other hand, can have parameters of any type, since they’re passed in directly using a normal C function call.

Functions declared using `cdef`, like Python functions, will return a False value when execution leaves the function body without an explicit return value. This is in contrast to C/C++, which leaves the return value undefined.

### Automatic Type Conversions

In most situations, automatic conversions will be performed for the basic numeric and string types when a Python object is used in a context requiring a C value, or vice versa. The following table summarises the conversion possibilities.

<table align="left">
<tr><td><b>C types</b></td><td><b>From Python types</b></td><td><b>To Python types</b></td></tr>
<tr><td>`char, short, int, long`</td><td>`int, long`</td><td>`int`</td></tr>
<tr><td>`int, long, long long`</td><td>`int, long`</td><td>`long`</td></tr>
<tr><td>`float, double, long double`</td><td>`int, long, float`</td><td>`float`</td></tr>
<tr><td>`char*`</td><td>`str`</td><td>`str`</td></tr>
<tr><td>`struct, union`</td><td></td><td>`dict`</td></tr>
</table>

### Statements and Expressions

Control structures and expressions follow Python syntax for the most part. When applied to Python objects, they have the same semantics as in Python (unless otherwise noted). Most of the Python operators can also be applied to C values, with the obvious semantics.

If Python objects and C values are mixed in an expression, conversions are performed automatically between Python objects and C numeric or string types.

## Executing Cython Code

### Manual Compiliation

Cython code is normally saved in files ending with `.pyx` (the `x` indicates it is different from standard Python code.) A Cython file must be translated to C using the command:

```
cython my_module.pyx
```

This will create a file called `my_module.c` which is the C source for a Python extension module. A useful additional switch is -a which will generate an HTML document (`my_module.html`) that shows which Cython code translates to which C code line by line.

Once the C file has been generated, it must be compiled into a shared library. This may vary according to the operating system, but for Linux it would be something like:

```
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 \
        -o my_module.so my_module.c
```

This command will create a library called `my_module.so`. This library can be treated just like any Python module and imported using the normal import statement:

```
import my_module
```

### A Simpler Way

Cython can be used conveniently and interactively from a web browser through the IPython notebook.

To enable support for Cython compilation, install Cython and load the Cython extension from within IPython:

In [1]:
%load_ext Cython

Cython code can now be compiled using the `%%cython` cell magic command:

In [None]:
%%cython

def cfunc(int n):
    cdef int a = 0
    for i in range(n):
        a += i
    return a

In [None]:
print(cfunc(10))

It is also possible to see Cython's code analysis using the `--annotate` option.

In [None]:
%%cython --annotate

def cfunc(int n):
    cdef int a = 0
    for i in range(n):
        a += i
    return a

## A Simple Example

The following pure Python example generates a list of `kmax` prime numbers.

In [1]:
def primes(kmax):
    p = [None] * 1000 # Initialize the list to the max number of elements
    if kmax > 1000:
        kmax = 1000
    result = []
    k = 0
    n = 2
    while k < kmax:
        i = 0
        while i < k and n % p[i] != 0:
            i = i + 1
        if i == k:
            p[k] = n
            k = k + 1
            result.append(n)
        n = n + 1
    return result

Let's time it to see how long it takes to generate 1000 primes.

In [2]:
%timeit primes(1000)

1 loop, best of 3: 445 ms per loop


This code can be run without any changes in Cython. The simplest way to do this is by using the `%%cython` cell magic:

In [12]:
%%cython
def primes(kmax):
    p = [None] * 1000 # Initialize the list to the max number of elements
    if kmax > 1000:
        kmax = 1000
    result = []
    k = 0
    n = 2
    while k < kmax:
        i = 0
        while i < k and n % p[i] != 0:
            i = i + 1
        if i == k:
            p[k] = n
            k = k + 1
            result.append(n)
        n = n + 1
    return result

Now let's see if there was any improvement.

In [13]:
%timeit primes(1000)

1 loop, best of 3: 322 ms per loop


As you can see, this improved the performance of the pure Python implementation. But can we do better? Let's try adding some types using the Cython `cdef` statement.

<div class="alert alert-success">
Make a copy of the Cython version of primes from the cell above, then declare the `i`, `k`, and `n` variables as type `int`. Replace the `p = [None] * 1000` line with the declaration `cdef int p[1000]`.
</div>

In [17]:
%%cython
def primes(int kmax):
    cdef int p[1000]# Initialize the list to the max number of elements
    if kmax > 1000:
        kmax = 1000
    cdef int result[1000];
    cdef int k = 0
    cdef int n = 2
    cdef int i = 0
    while k < kmax:
        i = 0
        while i < k and n % p[i] != 0:
            i = i + 1
        if i == k:
            p[k] = n
            k = k + 1
            result.append(n)
        n = n + 1
    return result

In [15]:
%timeit primes(1000)

10 loops, best of 3: 35.9 ms per loop


Wow, that made a big difference!