# File PreProcessing With C and Cython

__William Murphy__ - *Machine Learning Specialist*<br> *General Reinsurance Corporation*
<br>*email:* will.robert.murphy@gmail.com

---

*This notebook sets up an efficent way opening files and extracting data from them. I make use of __Cython__ and __C's__ internal libraries to speed file file processing.*

### Tokenize Cython Class
---
*I build a cython class that uses C's __stdio__, __stdlib__, and __string__ libraries to to the heavy lifting of reading a file's contents.*<br>
*First, we must load in the cython extension.*

In [2]:
%load_ext cython

In [3]:
%%cython -f
# distutils: extra_compile_args = -fopenmp
# distutils: extra_link_args = -fopenmp
# cython: language_level=3
# cython: embedsignature=True
# cython: profile=True
# cython: boundscheck=False
# coding: utf8

from libc.stdlib cimport malloc, realloc, free
from libc.stdio cimport fopen, fclose, FILE, EOF, fseek, SEEK_END, SEEK_SET
from libc.stdio cimport ftell, fgetc, fgets, getc, gets, feof, fread, getline
from libc.string cimport strlen, memcpy, strcpy, strtok, strchr, strncpy
from cython.parallel import prange, parallel, threadid

# - C structure that is set to readonly
cdef readonly struct FileContents:
    char *contents
    
cdef class CyReadFile:
    """Read in the contents of a file."""
    cdef:
        FileContents *File
        FILE *fp
        char *filename
        char *delimiter
        long file_size
        bint is_open
        bint EO_STR
    
    def __init__(self, char *delimiter, char *filename):
        self.File = <FileContents*>malloc(sizeof(CyReadFile))
        self.delimiter = delimiter
        self.filename = filename
        self.File.contents = NULL
        self.is_open = 0
        self.EO_STR = 0
        self.file_size = 0
        self.fp = NULL
        
    def open_file(self):
        """Open the file for reading."""
        self.fp = fopen(self.filename, "r")
        if self.fp == NULL:
            raise FileNotFoundError(2, "No such file or directory: '%s'" % self.filename)
        else:
            # file is now open
            self.is_open = 1
    
    def read_in_file(self):
        """Read in the entire file."""
        if self.is_open == 1:
            # get the length of the file
            fseek(self.fp, 0, SEEK_END)
            self.file_size = ftell(self.fp)
            fseek(self.fp, 0, SEEK_SET)
            # allocate memory for reading in the file
            self.File.contents = <char*>malloc(self.file_size*sizeof(char))
            # read entire file into the struct
            fread(self.File.contents, 1, self.file_size, self.fp)
            # close the file once it's read into the char array
            fclose(self.fp)
            # set is_open to 0
            self.is_open = 0
              
    def read_file_in_parallel(self):
        """Bypass the gil and read in the file."""
        if self.is_open == 1:
            with nogil:
                # get the length of the file
                fseek(self.fp, 0, SEEK_END)
                self.file_size = ftell(self.fp)
                fseek(self.fp, 0, SEEK_SET)
                # allocate memory for reading in the file
                self.File.contents = <char*>malloc(self.file_size*sizeof(char))
                # read entire file into the struct
                fread(self.File.contents, 1, self.file_size, self.fp)
                # close the file once it's read into the char array
                fclose(self.fp)
                # set is_open to 0
                self.is_open = 0
    
    def __dealloc__(self):
        """Deallocate memory"""
        free(self.File.contents)
        free(self.File)
        free(self.fp)
        free(self.filename)
        free(self.delimiter)
        
            
# - To use the cython class, we must create a python subclass that inherits from it.
# - I will set the cython variables concretely in the Python subclass

# test data
emlFile = b"Y:\\Shared\\USD\\Business Data and Analytics\\Claims_Pipeline_Files\\Mapping_Files\\EmlMappingFile.csv"

class PyReadFile(CyReadFile):
    """A python wrapper around a cython class."""
    def __init__(self):
        super().__init__(b',', emlFile)
    
        
def py_read_file(filename):
    with open(filename, "r") as f:
        return f.read()
    



###  Performance Test

In [12]:
pyrf = PyReadFile()
pyrf.open_file()
%timeit pyrf.read_in_file()
%timeit py_read_file(emlFile)

81.3 ns ± 0.869 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
3.53 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


As you can see, __cython__ reads in the file at the __*nano second*__ level and the pure __python__ *open* method runs at the __*millisecond*__ level. Lets convert this information to see how much faster the __cython__ file reading was.<br>

$$ 1 \quad millisecond = 1000000 \quad nanoseconds $$

$$ 3.53 \quad milliseconds = 3530000 \quad nanoseconds$$

$$ 81.3 \quad nanoseconds  = 0.0000813 \quad milliseconds$$
<br>
<br>
This means, we have dropped our __*run time complexity*__ by __3529999.9999187%__ !

### Performance Test with Using Concurrency
---
One of the great advantages of using __Cython__ is that we can bypass the *GIL* or __*Global Interpreter Lock*__. To do so, we use the cython keyword: <br> 
>```cython
>    nogil
>```

We use the *__nogil__* keyword like so:
```cython
# cython: boundscheck = False
from cython.parallel cimport prange

# NOTE: We must set boundscheck = False for prange to work.
# - this function bypasses the gil
cdef int f1(int n):
    cdef int i = 0
    cdef int sum_ = 0
    with nogil:
        for i in range(n):
            sum_ += i
    return sum_

# - this function is safe to use w/o the gil
# NOTE: It does not bypass the gil.
cdef int f2(int n) nogil:
     cdef int i = 0
    cdef int sum_ = 0
    for i in range(n):
        sum_ += i
    return sum_

```

### Cython Concurrency Calculations


In [4]:
# setup for comparing parallel cython and python open
pyrf_parallel = PyReadFile()
pyrf_parallel.open_file()
%timeit pyrf_parallel.read_file_in_parallel()
%timeit py_read_file(emlFile)

87.4 ns ± 7.48 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
3.42 ms ± 40.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


As expected, simply using the  __`nogil`__ without considering the other aspects of what __cython__ is doing under the covers does not provide a speed boost.

### Conclusion

I hope you enjoyed this lesson on using __cython__ to read in files. I'm currently in the process of creating file Tokenizer class that will eventually be part of a larger __*NLP*__ cython package.<br>
Thanks!<br><br>
*Will*