# Optimisation: NumPy

Vanilla Python is bad at storing and manipulating large amounts of data. Lists, dictionaries, sets and tuples are inherently slow and inefficient. If you're handling large amounts of data in a performance-critical area of your code, substantial gains can be made by using other data types to store your data.

One of the most popular ways of holding large amounts of data  is through the use of the [NumPy](https://numpy.org/) package. NumPy provides access to the powerful array data type which can store large N-dimensional arrays of data. As Numpy largely overlays compiled code written in the C language, it bypasses many of the weaknesses of Python to provide very fast performance for common linear algebra (and other) operations.

This notebook doesn't aim to give you a working knowledge of NumPy. Instead, it aims to offer a brief demonstration of the savings which can be made by using NumPy.

## Dot Product Example

In the example below we generate two random vectors each with 1,000,000 entries and then calculate the dot product of them

In [1]:
 
import random

%load_ext line_profiler

def random_dot_product(n):
  list_1 = [random.randrange(1, 100, 1) for i in range(n)]
  list_2 = [random.randrange(1, 100, 1) for i in range(n)]

  return(sum([x*y for x, y in zip(list_1, list_2)]))

%lprun -f random_dot_product random_dot_product(1000000)

Timer unit: 1e-07 s

Total time: 9.26864 s
File: C:\Users\jacob\AppData\Local\Temp\ipykernel_5900\622497151.py
Function: random_dot_product at line 5

Line #      Hits         Time  Per Hit   % Time  Line Contents
     5                                           def random_dot_product(n):
     6         1   44903909.0 44903909.0     48.4    list_1 = [random.randrange(1, 100, 1) for i in range(n)]
     7         1   45015962.0 45015962.0     48.6    list_2 = [random.randrange(1, 100, 1) for i in range(n)]
     8                                           
     9         1    2766559.0 2766559.0      3.0    return(sum([x*y for x, y in zip(list_1, list_2)]))

In [2]:
 
import numpy as np

%load_ext line_profiler

def random_dot_product(n):
  array1 = np.random.rand(n)
  array2 = np.random.rand(n)

  return(np.dot(array1, array2))

%lprun -f random_dot_product random_dot_product(1000000)

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


Timer unit: 1e-07 s

Total time: 0.0217881 s
File: C:\Users\jacob\AppData\Local\Temp\ipykernel_5900\159671606.py
Function: random_dot_product at line 5

Line #      Hits         Time  Per Hit   % Time  Line Contents
     5                                           def random_dot_product(n):
     6         1      93069.0  93069.0     42.7    array1 = np.random.rand(n)
     7         1      65484.0  65484.0     30.1    array2 = np.random.rand(n)
     8                                           
     9         1      59328.0  59328.0     27.2    return(np.dot(array1, array2))

The second case, using NumPy, executes around 300 times faster than the first. This is because the bulk of the calculations has been shifted out of Python and into C, which is much faster. In addition, the availability of intrinsic function specifically designed to generate a random array and perform a dot product both allow a specific and optimised implementation compared to the first version where we wrote the calculations in raw Python.