<a href="https://colab.research.google.com/github/OSGeoLabBp/tutorials/blob/master/english/python/vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectorization

Vectorization is used to speed up your Python code. It is essential working with large data sets.

A vectorized code contains no loop written in Python, instead we make operations on compound data structures like *numpy* arrays or *pandas* data series. These modules (*numpy*, *pandas*) are written in C/C++ and the loops are more effective.

The vectorized solution is not only faster but the code is shorter (easier to maintain and debug).

Let's see some examples using non-vectorized and vectorized solution.

We use large vectors/matrices with ten million elements to make the difference between non-vectorized and vectorized solution more visible.



## Vector and scalar product

We have a vector of 10 000 float numbers and we would like to scale the elements of the vector.

In [2]:
import numpy as np
import random
import time

# initializing data used later
n = 10_000_000              # size of vector
scalar = 2.564              # scaler for the vector
vlist = [random.random() for i in range(n)]  # generating random list (non-vectorized)
vect = np.random.rand(n)     # generating random vector (vectorized)

In [3]:
start_time = time.time()    # get current time
slist = []
for i in range(n):
    slist.append(vlist[i] * scalar)
print(f'Non vectorized solution for {n} items in {n}, {(time.time() - start_time):.2f} seconds')

Non vectorized solution for 10000000 items in 10000000, 1.97 seconds


In [4]:
start_time = time.time()    # get current time
s1list = [v * scalar for v in vlist]
print(f'List comprehension solution for {n} items in {(time.time() - start_time):.2f} seconds')

List comprehension solution for 10000000 items in 0.87 seconds


In [5]:
start_time = time.time()    # get current time
svect = vect * scalar
print(f'Vectorized solution for {n} items in {(time.time() - start_time):.2f} seconds')

Vectorized solution for 10000000 items in 0.07 seconds


## Find the largest value in a vector

In [6]:
start_time = time.time()    # get current time
vmax = vlist[0]
for v in vlist[1:]:
    if v > vmax: vmax = v
print(f'Max item non-vectorized {vmax} in {(time.time() - start_time):.2f} seconds')

Max item non-vectorized 0.9999999394889566 in 0.84 seconds


In [7]:
start_time = time.time()    # get current time
vmax = max(vlist)
print(f'Max item list-vectorized {vmax} in {(time.time() - start_time):.2f} seconds')

Max item list-vectorized 0.9999999394889566 in 0.14 seconds


In [8]:
start_time = time.time()    # get current time
vmax = np.max(vect)
print(f'Max item list-vectorized {vmax} in {(time.time() - start_time):.2f} seconds')

Max item list-vectorized 0.999999978113647 in 0.02 seconds


## Find the largest absolute difference between the neighboring vector items

In [9]:
start_time = time.time()    # get current time
max_dif = abs(vlist[0] - vlist[1])
for i in range(1, n):
    dif = abs(vlist[i-1] - vlist[i])
    if dif > max_dif: max_dif = dif
print(f'Max abs difference non-vectorized {max_dif} in {(time.time() - start_time):.2f} seconds')

Max abs difference non-vectorized 0.9996972581419408 in 4.45 seconds


In [10]:
start_time = time.time()    # get current time
max_dif = np.max(np.abs(vect[:-1] - vect[1:]))
print(f'Max abs difference vectorized {max_dif} in {(time.time() - start_time):.2f} seconds')

Max abs difference vectorized 0.999786309341876 in 0.11 seconds


## Calculate row wise mean of a matrix

In [11]:
matrix = vect.reshape((5000, n // 5000))
list_matrix = list(matrix)
start_time = time.time()    # get current time
row_means = []
for row in list_matrix:
    row_means.append(sum(row)/ len(row))
print(f'Row wise mean non-vectorized in {(time.time() - start_time):.2f} seconds')

Row wise mean non-vectorized in 1.00 seconds


In [12]:
start_time = time.time()    # get current time
row_means = np.mean(matrix, axis=1)
print(f'Row wise mean vectorized in {(time.time() - start_time):.2f} seconds')

Row wise mean vectorized in 0.01 seconds
