# First Time to Python Multi-processing
I want to share something I learned about Python multi-processing on handling middle size data on a single machine.

By middle size, I mean GB level data. In my case, that is 13.2GB of data, which is small enough to load into memory but once load into memeory, you may not have enough memory to run any algorithms.

Let's dive into it.

# Problem Description
I have 13.2GB of data consisted of 7481 binary files. Each file is about 1.81M.

My task is simple:
    1. Read the file into a Python data structure
    2. Process the data to an image
    3. Write the image to disk
    
## File Format
The file stores a collection of vectors. Each vector has 4 elements: x, y, z, r

Each element is a float32, so the size of a vector is 16 bytes.

There is no delimiters between 2 vectors. Therefore, if the file size is 32 bytes, it has and only has 2 vectors. If the size of a file is not dividible by 16 bytes, it is not a valid file.
   


In [2]:
def read_velodyne_data(file_path):
    """
    Read velodyne binary data and return a numpy array
    """

    # First, check the size of this file to see if it's a valid velodyne binary file
    size = os.stat(file_path).st_size
    if size % 16 != 0:
        raise Exception('The size of '+file_path+' is not dividible by 16 bytes')

    with open(file_path, 'rb') as f:
        # Allocate memory for numpy array
        velodyne_data = np.empty(shape=(size//16, 4), dtype=np.float32)

        # Read the data, 16 bytes each time
        i = 0
        reader = BufferedReader(f)
        while reader.peek(16):
            read_bytes = reader.read(16)
            velodyne_data[i] = np.frombuffer(read_bytes, dtype=np.float32)
            i += 1

        # Check whether correct amount of bytes were read
        if i != size/16:
            error = ' '.join(['The file size is', str(size), ', but', str(i), 'bytes were read'])
            raise Exception(error)

        return velodyne_data

## Let's process the data
Each file is a velodyne point cloud scan

This is an sample processed image. This a a birdview of the lidar data
<img src="birdview.jpg" style="width: 600px; height: 600px"/>

The image is 1600 x 1600 pixel. Each pixel represent 10cm x 10cm of space. The range of the lidar in the x and y direction is 80m.

Where x is left and right, y is front and back.

We have 3 channels for this image.

In [3]:
def bird_view_map(velodyne_data):
    """
    Implements the method in https://arxiv.org/pdf/1611.07759.pdf
    :param velodyne_data: a list of velodyne cloud points
    :return: 2D image with 3 channels: height, intensity and density
    """
    bird_view = np.zeros(shape=(1600, 1600, 3), dtype=np.float32)
    for point in velodyne_data:
        x = point[0]
        y = point[1]
        z = point[2]
        r = point[3]
        # if (-40 <= x <= 40) and (-40 <= y <= 40):
        xi = 800 - np.int(np.ceil(x/0.1))
        yi = 800 - np.int(np.ceil(y/0.1))
        if -z > bird_view[yi][xi][0]:
            bird_view[yi][xi][0] = -z
            bird_view[yi][xi][1] = r
            bird_view[yi][xi][2] += 1

    # todo: normalize birdview with the real method
    bird_view[:,:,0] = np.interp(bird_view[:,:,0], xp=(np.min(bird_view[:,:,0]), np.max(bird_view[:,:,0])), fp=(0, 255))
    bird_view[:,:,1] = np.interp(bird_view[:,:,1], xp=(0, 1), fp=(0, 255))
    bird_view[:,:,2] = np.interp(bird_view[:,:,2], xp=(np.min(bird_view[:,:,2]), np.max(bird_view[:,:,2])), fp=(0, 255))
    return bird_view

In [4]:
from glob import glob
import cv2
import os
import numpy as np
from io import BufferedReader
import time


paths = glob('data/raw/*.bin')
t = time.time()
for path in paths:
    data = read_velodyne_data(path)
    view = bird_view_map(data)
    cv2.imwrite(''.join(['data/processed/', os.path.basename(path)[:-4], '.png']), view)
used = time.time() - t
print('Single processed version used', used, 'seconds to process', len(paths), 'files')

Single processed version used 143.17335605621338 seconds


### How can we optimize this program?
Notice that we have 2 functions here:
1. __read_velodyne_data__, which does the input.
2. __bird_view_map__, which does the computation.

Before we get into any of those parallelism things, the first thing is always try to optimize our individual functions. Because if your functions are slow, your parallel code is just slow code on more cores.

We are not going to use our intuition or experiences to just staring at our code and hope something could happen.  

We are going to employ profiling tools to precisely measure our code performance.

### Profile the code
The traditional cProfile won't be useful here because it only profiles the code at a function level. It only tells you how much time each function call uses. Let's say you know that function __f1__ used 80% of your program time. You still don't know which line in __f1__ cost you so much time.

That's why we need a line by line profiler [line_profiler](https://github.com/rkern/line_profiler) by [Robert Kern](https://github.com/rkern).

Please look at my __README.md__ and line_profiler's README for profiling instructions.

### Now we know
Now we know that __while reader.peek(16)__ is the bottomneck of __read_velodyne_data__ function.

__peek__ is unnecessary because we are going to read the data anyway. Therefore, we can change this function to

In [5]:
def read_velodyne_data_quick(file_path):
    """
    Read velodyne binary data and return a numpy array
    """

    # First, check the size of this file to see if it's a valid velodyne binary file
    size = os.stat(file_path).st_size
    if size % 16 != 0:
        raise Exception('The size of '+file_path+' is not dividible by 16 bytes')

    with open(file_path, 'rb') as f:
        # Allocate memory for numpy array
        velodyne_data = np.empty(shape=(size//16, 4), dtype=np.float32)

        # Read the data, 16 bytes each time
        i = 0
        reader = BufferedReader(f)
        read_bytes = reader.read(16)  # As you can see here, we read directly and check if read_bytes has values.
        while read_bytes:
            velodyne_data[i] = np.frombuffer(read_bytes, dtype=np.float32)
            read_bytes = reader.read(16)
            i += 1

        # Check whether correct amount of bytes were read
        if i != size/16:
            error = ' '.join(['The file size is', str(size), ', but', str(i), 'bytes were read'])
            raise Exception(error)

        return velodyne_data

### Now let's run it for 100 files

In [10]:
t = time.time()
paths = glob('data/raw/*.bin')[:100]
for path in paths:
    data = read_velodyne_data_quick(path)
    view = bird_view_map(data)
    cv2.imwrite(''.join(['data/processed/', os.path.basename(path)[:-4], '.png']), view)
used2 = time.time() - t
print('Better version used', used2, 'seconds')
print('It is ', used - used2, 'seconds faster for', len(paths), 'files')

Better version used 131.4069118499756 seconds
It is  11.766444206237793 seconds faster for 100 files


### 4% Performance Gain
As you can see, conversativly speaking, it is about 5 seconds faster for 100 files. This is about 4% faster.

I don't know about you, but I am impressed. Just by changing a single line of IO, we can save 4% of time.

For our data, we have 7821 files, we will save 391.05 seconds in total. 

Just image if you have much more data, 4% is incredible optimization.

### Multi-process
Because notebook doesn't work well with multi-process code, we will dicover the multi-process world in an ordinary Python script.

Please see __multi.py__