<a href="https://colab.research.google.com/github/EveryTimeIWill18/Cython_Repo/blob/master/Cython_Tutorial_One.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://upload.wikimedia.org/wikipedia/en/thumb/c/ce/Cython-logo.svg/1200px-Cython-logo.svg.png =200x100)

# Getting started with Cython




# Introduction

As a big data engineer and machine learning engineer at a multinational reinsurance firm, I frequently have to build data pipelines and these pipelines can often be slow when utilizing standard python.  To circumvent clogs in certain memory intensive tasks, I make use of cython
<br>The __Cython__ language is a superset of __Python__, this means that almost all python code works in cython.  The reason for using cython is that python code can sometimes be slow and by converting some of the slow python code into cython, we can dramatically reduce the runtime.
As always, it's best if we begin with an example of the speed benefits of cython.

To begin, we must load in the cython extension when working inside a jupyter notebook.

In [0]:
%load_ext cython

### Cython vs. Python Speed
As a first example, lets create two functions that compute the sum of integers 1 to n<br>
*Example 1: Compute*<br>
## $$s = \sum_{i=0}^{n}s_{i}$$

- *__NOTE__: we must add __%%cython%%__ to enable running cython code.*
- *Cython C level(__cdef__) functions must be wrapped in a python function to use them i standard python.*

In [0]:
%%cython

# - python version
def py_sum(n):
  """Compute the sum"""
  i = 0
  the_sum = 0
  for i in range(n):
    the_sum += i
  return the_sum

# - cython version
cdef inline int cy_sum(int n):
  cdef int i = 0
  cdef int the_sum = 0
  for i in range(n):
    the_sum += i
  return the_sum

def cy_wrapper(int n):
  return cy_sum(n)


### Run Time Comparison
As you can see, cython is clearly the winner in terms of run time, with numpy being the slowest.

In [4]:
import numpy as np

print("python time complexity")
%timeit py_sum(100000000)
print('\n')

print('cython time complexity')
%timeit cy_wrapper(100000000)
print('\n')

print("python sum function")
%timeit sum(range(1,100000000))
print('\n')

print("numpy sum")
%timeit np.sum(range(1, 100000000))

python time complexity
1 loop, best of 3: 4.44 s per loop


cython time complexity
10 loops, best of 3: 36.8 ms per loop


python sum function
1 loop, best of 3: 2.11 s per loop


numpy sum
1 loop, best of 3: 21.5 s per loop


# Cython Funtion Types
In cython, we can declare three types of functions:
- __cdef__: *C level function. cdef functions cannot be called directly in pure python.*
- __cpdef__: *C level function with a python binding so it can be called in pure python.*
- __def__: *Pure python function.*

Within cython, variables must also be declared with the __*cdef*__ declaration.
i.e.
```python
cdef int i            # declare an integer variable
cdef double d = 10.1  # declare a double and initialize it to 10.1
```

# Cython DataTypes
Unlike python, where variables are infered by the python interpreter, cython variables must be declared.  The next cell demonstrates some of the data types available in cython.


In [0]:
%%cython
# add -a to check python iteration


# - declaration of basic C variable types
cdef:
  int i = 0       # integer 
  bint b = True   # boolean
  char c = b'w'   # character
  double d = 10.1 # double
  float f = 1.10  # floating point
  long l = 1000   # long int
  long double ld = 100000000000.10 # long double
  
# - declaration of python types within C
#    these python objects are declared by cython
#    as C pointers to built-in Python struct type
cdef list cy_list
cdef dict cy_dict
cdef str cy_str
cdef set cy_set

# Cython Pointers
In languages like __C__,  __C++__ , __Rust__, and __Go__,
they allow for variable types called *pointers*.
A pointer is a variable that stores the memory location of another variable. In __C__/__C++__ we declare them like:
```C
int a = 10;
// declaration of a pointer to an int
int *aPtr = NULL;
*aPtr = &a; // store the address of a

printf('%d\n', *aPtr); // prints out the value 10
printf('%x\n', aPtr);  // prints address 0x7ffee90ef768

// dereference operator
// this changes the value stored at 0x7ffee90ef768
*aPtr = 100;  
printf('%d\n', a); // a is now 100

```

### For memory allocation, C uses the following functions from the stdlib.h library.<br> Memory is taken from the *free store(heap) *.
```c
 #include <stdlib.h> // malloc, realloc, free
 #include <stdio.h>
 

 int main(int argc, char *argv[]) {
          //declare a pointer to a char. 
          char *token = NULL; //good practice to set to NULL
          
          // initial memory allocation
          token = (char*)malloc(5*sizeof(char)); // allocate 5 bytes of memory
          
          // fill the 5 bytes with chars
          int i = 0;
          *(char + i) = 'h';
          *(char + (i+1)) = 'e';
          *(char + (i+2)) = 'l';
          *(char + (i+3)) = 'l';
          *(char + (i+4)) = 'o';       //  0 1 2 3 4  memory position
          // token hold the following: // [h|e|l|l|o] 
          // NOTE: pointers in C are equivalent to arrays.
          
          // If we want to store hello world into token, we use the realloc function
          token = (char*)realloc(token, 11*sizeof(char));
          *(char + (i+5)) = ' ';
          *(char + (i+6)) = 'w';
          *(char + (i+7)) = 'o';
          *(char + (i+8)) = 'r';
          *(char + (i+9)) = 'l';
          *(char + (i+10)) = 'd';
          
          
 return 0;
 }

```


## Pointers and Memroy allocation in Cython
We can do the same types of memory allocation within cython.

In [6]:
%%cython
cimport cython
from libc.stdlib cimport malloc, realloc, free # import C level functions for memory allocation

# - declare a pointer to a char
cdef char* token = NULL; # good practice to initialize pointers to NULL
# - allocate 5 bytes of memory to token
token = <char*>malloc(5*sizeof(char))

# - insert values into the token array
token[0] = b'h'
token[1] = b'e'
token[2] = b'l'
token[3] = b'l'
token[4] = b'o'
cdef int i = 0
print(chr(token[i]))
print(chr(token[i+1]))
print(chr(token[i+2]))
print(chr(token[i+3]))
print(chr(token[i+4]))
print('\n\n')
# - !WARNING: you must be careful with pointers as you can access memory 
# that may be storing other functions or variables and you can accidentally 
# overwrite them. Use caution when using pointers.

#print(chr(token[i+5])) # we access a memory location not assigned to token

# - if we need more memory, we can use realloc just the C code above
token = <char*>realloc(token, 11*sizeof(char))
token[5] = b' '
token[6] = b'w'
token[7] = b'o'
token[8] = b'r'
token[9] = b'l'
token[10] = b'd'
print(chr(token[i]))
print(chr(token[i+1]))
print(chr(token[i+2]))
print(chr(token[i+3]))
print(chr(token[i+4]))
print(chr(token[i+5]))
print(chr(token[i+6]))
print(chr(token[i+7]))
print(chr(token[i+8]))
print(chr(token[i+9]))
print(chr(token[i+10]))

# - we the C's free function to release the memory taken from the free store(heap)
free(token) # release memory back to the heap
#print(token) # !WARNING: this will cause an error to be raised.

h
e
l
l
o



h
e
l
l
o
 
w
o
r
l
d


# C Data Containers
Cython has access to C's __*struct*__, __*enum*__, and __*union*__  data containers. In C/C++, these can be created by:

```c

/*Ex: struct*/
// structs allow the user to combine data of different types.
// example of a struct which stores character tokens
struct __tok__ {
  
  char **tokens;       // store the tokens
  int *token_lengths;  // length of each token

} token; // set the alias of __tok__ to token

// create a pointer to the __tok__ struct
struct __tok__ *tPtr = (struct __tok__*)malloc(sizeof(struct __tok__));

/*access tPtr's data*/
// allocate memory for the token lengths
// Assume the input sentence has 10 words
tPtr->*token_lengths = (int*)malloc(10*sizeof(10));

// allocate memory for 20 tokens
// **tokens is a pointer to pointer to char
tPtr->tokens = (char**)malloc(20*sizeof(char *));
// each element in the tokens array is a pointer
// so we must allocate memory for each pointer now
for (int i=0; i < 20; i++) {
// assume each token has length of 10 chars
  *(tPtr->tokens + i) = (char*)malloc(10*sizeof(char));
}


/*Ex enum*/
// used to assign names to integral constants
enum week_days {Sunday, 
Monday, Tuesday, Wednesday, 
Thursday, Friday, Saturday};

enum week_days day;
day = Monday;
printf("%d\n", day);  // prints 1

/* Ex union*/
union data {
    int i;
    float f;
    char str[20];

} d1, d2, *d3;


```

### C Data Containers in Cython

In [7]:
%%cython
import warnings
from libc.stdlib cimport malloc, realloc, free
warnings.filterwarnings(action='once')
# - create a struct called __tok__
cdef struct __tok__:
  char **tokens
  int *num_tokens
  
# - create an enum called week_days
cdef enum week_days:
  Sunday, Monday, Tuesday,
  Wednesday, Thursday,
  Friday, Saturday
  
# - create a union called data
cdef union data:
  double *d_data_array
  int *i_data_array
  char *c_data_array
  float *f_data_array
  
# - create a struct pointer to __tok__ and allocate memory
cdef __tok__ *tPtr = <__tok__*>malloc(sizeof(__tok__))
tPtr.tokens = <char**>malloc(5*sizeof(char*)) # allocate memory for 5 tokens

# - iterate through each token container and allocate memory
cdef int j = 0
for j in range(5):
  tPtr.tokens[j] = <char*>malloc(5*sizeof(char)) ## assign a size of 5 for each token
  
# - load tokens into tPtr.tokens
b1 = b'hello'
b2 = b'world'
b3 = b'there'
b4 = b'other'
b5 = b'nicer'

tPtr.tokens[0] = b1
tPtr.tokens[1] = b2
tPtr.tokens[2] = b3
tPtr.tokens[3] = b4
tPtr.tokens[4] = b5

print(tPtr.tokens[0])
print(tPtr.tokens[1])
print(tPtr.tokens[2])
print(tPtr.tokens[3])
print(tPtr.tokens[4])



b'hello'
b'world'
b'there'
b'other'
b'nicer'


## Cython Advanced features
Cython has a number of useful advanced features, some taken from C, others defined only within cython.

- __ctypedef__: C allows for creating an alternative name to datatypes. It's particulary useful when working with combersome function pointers.

- __function pointers__: Cython can make use of C's function pointer notation, that is a pointer to a function.<br>
To read a function pointer, have a look at the following: __int (*compute) (int, int)__. To read this, start with the first set of brakets that start with the __*__ symbol, this indicates that we are working with a function pointer. From here, look to the right, we haev a function pointer that takes in 2 ints. Now look at the left most value, so the function pointer returns an int. Thus, this function poitner takes in 2 ints and returns an int.


### *ex*: typedef keyword in C
```c
// assume we have two functions add, mult that take in ints a, b and return an int

// using typedef
typedef int(*opr)(int, int);

// function that takes in a function pointer and 2 ints and returns an int
int compute(int (*operation)(int, int), int x, int y) {
    return operation(x, y);
}

// same function as above but now using typedef
int compute(opr op, int x, int y) {
  return op(x, y);
}

int add(int a, int b) { return a + b;}
int mult(int a, int b) { return a * b;}


```


In [26]:
%%cython

# - fused types
# fused types allow generic programming within cython
# Currently, only variables and function/method arguments can be fuesd types
ctypedef fused int_float_double:
  int
  float
  double

# - ctypedef for the function pointer opr
ctypedef int (*opr)(int, int)

# - same as opr but using fused types
ctypedef int_float_double (*operation)(int_float_double v1, int_float_double v2)

# - compute takes in a function pointer and 2 ints
cdef int compute(opr op, int x, int y):
  return op(x, y)

cdef int add(int a, int b):
  return a + b
cdef int mult(int a, int b):
  return a * b

cdef int_float_double addf(int_float_double a, int_float_double b):
  return a + b
cdef int_float_double multf(int_float_double a, int_float_double b):
  return a * b


# - utilizing the function pointer and ctypedef
print(compute(mult, 10, 20))
print(compute(add, 100, 99))

# - utilizing fused type function pointer and ctypedef
print(addf(10.1, 10))
print(multf(100.1, .512))

200
199
20.1
51.2512
