<a href="https://colab.research.google.com/github/OSGeoLabBp/tutorials/blob/master/english/python/vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectorization

Vectorization is used to speed up your Python code. It is essential working with large data sets.

A vectorized code contains no loop written in Python, instead we make operations on compound data structures like *numpy* arrays or *pandas* data series. These modules (*numpy*, *pandas*) are written in C/C++ and the loops are more effective.

The vectorized solution is not only faster but the code is shorter (easier to read, maintain and debug).

Let's see some examples using non-vectorized and vectorized solution.

We use large vectors/matrices with ten million elements to make the difference between non-vectorized and vectorized solution more visible.



## Vector and scalar product

We have a vector of 10 000 float numbers and we would like to scale the elements of the vector.

In [2]:
import numpy as np
import random
import time

# initializing data used later
n = 10_000_000              # size of vector
scalar = 2.564              # scaler for the vector
vlist = [random.random() for i in range(n)]  # generating random list (non-vectorized)
vect = np.random.rand(n)     # generating random vector (vectorized)

In [None]:
start_time = time.time()    # get current time
slist = []
for i in range(n):
    slist.append(vlist[i] * scalar)
print(f'Non vectorized solution for {n} items in {n}, {(time.time() - start_time):.2f} seconds')

Non vectorized solution for 10000000 items in 10000000, 3.65 seconds


In [None]:
start_time = time.time()    # get current time
s1list = [v * scalar for v in vlist]
print(f'List comprehension solution for {n} items in {(time.time() - start_time):.2f} seconds')

List comprehension solution for 10000000 items in 1.27 seconds


In [None]:
start_time = time.time()    # get current time
svect = vect * scalar
print(f'Vectorized solution for {n} items in {(time.time() - start_time):.2f} seconds')

Vectorized solution for 10000000 items in 0.04 seconds


## Find the largest value in a vector

In [None]:
start_time = time.time()    # get current time
vmax = vlist[0]
for v in vlist[1:]:
    if v > vmax: vmax = v
print(f'Max item non-vectorized {vmax} in {(time.time() - start_time):.2f} seconds')

Max item non-vectorized 0.9999999144011416 in 1.71 seconds


In [None]:
start_time = time.time()    # get current time
vmax = max(vlist)
print(f'Max item list-vectorized {vmax} in {(time.time() - start_time):.2f} seconds')

Max item list-vectorized 0.9999999144011416 in 0.32 seconds


In [None]:
start_time = time.time()    # get current time
vmax = np.max(vect)
print(f'Max item list-vectorized {vmax} in {(time.time() - start_time):.2f} seconds')

Max item list-vectorized 0.9999999924191931 in 0.02 seconds


## Find the largest absolute difference between the neighboring vector items

In [None]:
start_time = time.time()    # get current time
max_dif = abs(vlist[0] - vlist[1])
for i in range(1, n):
    dif = abs(vlist[i-1] - vlist[i])
    if dif > max_dif: max_dif = dif
print(f'Max abs difference non-vectorized {max_dif} in {(time.time() - start_time):.2f} seconds')

Max abs difference non-vectorized 0.9999343275314208 in 6.18 seconds


In [None]:
start_time = time.time()    # get current time
max_dif = np.max(np.abs(vect[:-1] - vect[1:]))
print(f'Max abs difference vectorized {max_dif} in {(time.time() - start_time):.2f} seconds')

Max abs difference vectorized 0.999354060404062 in 0.09 seconds


## Calculate row wise mean of a matrix

In [None]:
matrix = vect.reshape((5000, n // 5000))
list_matrix = list(matrix)
start_time = time.time()    # get current time
row_means = []
for row in list_matrix:
    row_means.append(sum(row)/ len(row))
print(f'Row wise mean non-vectorized in {(time.time() - start_time):.2f} seconds')

Row wise mean non-vectorized in 1.69 seconds


In [None]:
start_time = time.time()    # get current time
row_means = np.mean(matrix, axis=1)
print(f'Row wise mean vectorized in {(time.time() - start_time):.2f} seconds')

Row wise mean vectorized in 0.01 seconds


## Calculate are from an array of coordinates

The formula

$ 2 \cdot Area = \sum (x_i - x_{i+1}) \cdot (y_i + y_{i+1})$


First we create a polygon of several points.

In [4]:
from math import sin, cos, pi
# generate points on the perimeter of a circle
c = []
r = 10
for i in np.arange(0, 360, 0.01):
    x = r * sin(i / 180 * pi)
    y = r * cos(i / 180 * pi)
    c.append((x, y))
coords = np.array(c)
coords1 = np.vstack((coords, coords[0]))    # add first point to the end

In [10]:
%%time
#non-vectorised solution
s = 0
n = coords.shape[0]
for i in range(n):
    j = i+1 if i < n-1 else 0
    s += (coords[i, 0] - coords[j, 0]) * (coords[i, 1] + coords[j, 1])
area1 = abs(s / 2)
print(f'Area non-vectorized: {area1:.5f}')

Area non-vectorized: 314.15926
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 49.1 ms


In [11]:
%%time
# vectorized solution, one liner
area2 = abs(np.sum((coords1[:-1, 0] - coords1[1:, 0]) * (coords1[:-1, 1] + coords1[1:, 1]))) / 2
print(f'Area vectorized {area2:.5f}')

Area vectorized 314.15926
CPU times: user 3.85 ms, sys: 0 ns, total: 3.85 ms
Wall time: 3.4 ms


Note, the non-vectorized solution runs ten times longer.