In [1]:
import numpy as np
import matplotlib.pyplot as plt
import time
from numba import jit, njit, vectorize, guvectorize

# Numba

**Numba** is a just-in-time (JIT) compiler that translates Python functions into optimized machine code at runtime. 

When a Python function is decorated with Numba, it gets compiled into highly efficient machine code the first time it's called, resulting in performance improvements ranging from 2x to 100x compared to pure Python code. This is particularly powerful for numerical computations and algorithms that would otherwise be slow in standard Python.

Key benefits of Numba:
- Near-C execution speeds while writing pure Python code
- Automatic parallelization capabilities
- Minimal code changes required - often just a decorator

In this notebook, we'll explore some basic examples of using Numba to optimize Python functions.

## @jit Decorator

### The Basics

The `@numba.jit` decorator enables just-in-time (JIT) compilation of Python functions, potentially yielding significant performance improvements over pure Python execution. 

Numba achieves its best performance improvements when working with:
- **NumPy arrays**: Numba excels at optimizing operations on NumPy arrays
- **Numerical computations**: Loops and mathematical operations get heavily optimized
- **Type-stable code**: Functions where variable types don't change during execution

When a decorated function is called, Numba performs _type inference_ - automatically determining the types of all variables in the function. Based on these inferred types, it generates optimized machine code specifically for those types.

Let's start with a simple example to demonstrate Numba's capabilities: computing the sum of all elements in a NumPy array. This example will show the basic usage of the `@jit` decorator and its performance impact.

In [None]:
# Compute the sum of all elements in an array using Numba
@jit
def simple_sum(data):
    s = 0
    for d in data:  # Explicit loop will be optimized by Numba
        s += d
    return s

Adding the decorator to the function instructs Numba to compile it to machine-code the first time this function is called.

Once compiled, Numba maintains:
- The compiled "jitted" version for optimized execution
- The original Python implementation, accessible via `simple_sum.py_func()`

This allows you to:
1. Compare performance between compiled and interpreted versions
2. Access the original Python code for debugging
3. Verify the correctness of the compilation by comparing results

In [None]:
# Define an arbitrary vector to test `simple_sum` function
input_data = 10*np.random.random(100_000)
input_data[:10]

`for` loops in Python are notoriously slow due to several factors:
- Python's dynamic typing requires type checking at each iteration
- The interpreter overhead on each loop iteration
- The flexibility of Python objects adds runtime costs

The conventional solution is to use NumPy's vectorized operations instead of explicit loops:
- NumPy functions are implemented in `C`
- They operate on entire arrays at once
- Array operations bypass Python's interpreter overhead

For our sum example, NumPy provides the built-in `np.sum()` function that efficiently computes array sums without explicit loops.
However, this example is to be considered as a starting point for understanding how and where Numba can provide an advantage in speeding up purely Pythonic code.

Let's now compare the execution time across the tree alternatives:
- The purely pythonic function implementation
- The same pythonic function `jit`-compiled by Numba
- The NumPy `np.sum` alternative

In [None]:
%%timeit -n 20 -r 5

# Purely pythonic function, preserved by Numba
# It can be accessed by .py_func()
simple_sum.py_func(input_data)

One important aspect to be aware of is that the JIT compilation occurs during the **first** function call. This results in:
- Higher execution time for the first run
- Significantly faster subsequent executions
- One-time compilation cost that amortizes over multiple calls

This behavior is particularly important to consider when:
1. Benchmarking code performance
2. Working with time-critical applications
3. Running short scripts that only execute the function once

> For accurate performance measurements, always discard the first function call, as it includes the compilation overhead.

In [None]:
# Measure execution time across multiple runs
N_TRIES = 20
measurements = np.zeros(N_TRIES)

# Run the function N_TRIES times and measure each execution
for i in range(N_TRIES):
   start = time.perf_counter()
   simple_sum(input_data)
   stop = time.perf_counter()
   measurements[i] = stop - start

# Convert to microseconds for better readability
measurements = measurements * 1e6

# Print statistics including first run (with compilation)
print("All iterations (including compilation overhead):")
print(f"Average: {np.mean(measurements):.2f} µs")
print(f"Minimum: {np.min(measurements):.2f} µs")
print(f"Maximum: {np.max(measurements):.2f} µs")

print("\nExcluding first iteration (pure execution time):")
print(f"Average: {np.mean(measurements[1:]):.2f} µs")
print(f"Minimum: {np.min(measurements[1:]):.2f} µs")
print(f"Maximum: {np.max(measurements[1:]):.2f} µs")

In [None]:
%%timeit -n 20 -r 5

# Measure jitted function performance using IPython's timeit magic
simple_sum(input_data)

In [None]:
%%timeit -n 20 -r 5

# Compare against NumPy's optimized sum function
np.sum(input_data)

This simple benchmark reveal significant performance differences:

1. **NumPy's sum function**:
  - Hundreds of times faster than pure Python
  - Demonstrates the power of vectorized operations
  - Optimized `C` implementation under the hood

2. **Numba JIT-compiled function**:
  - 50-100x speedup compared to pure Python
  - Achieves performance comparable to NumPy for this simple task
  - Maintains readable Python syntax while delivering near-`C` speed

This comparison highlights two key strategies for high-performance Python code:
- Using vectorized operations (NumPy approach)
- JIT compilation of pure Python code (Numba approach)

The advantage of Numba is that it allows you to write clear, straightforward Python code while still achieving performance similar to optimized `C`-level implementations like NumPy's.


### Types

One of the key advantages Numba offers over plain Python is its ability to infer data types and JIT-compile functions for the specific types being used. This results in more efficient code execution.

For example, consider the `simple_sum` function we have applied so far to a NumPy array of floats.


In [None]:
# Check the type of the first element in the input_data array
type(input_data[0])

In [None]:
# Inspect the types inferred and used by Numba for the simple_sum function
simple_sum.inspect_types()

If we use the function with a different input type, Numba will recompile the function to match the new type. 

This ensures optimal performance for various data types!

In [None]:
# Generate an array of random integers scaled by 10
input_data_int = 10 * np.random.randint(low=0, high=100, size=100_000)

# Check the type of the first element in the new input_data array
type(input_data_int[0])

In [None]:
%%timeit -n 20 -r 5

# Measure the jitted function over the new integer data array
simple_sum(input_data_int)

In [None]:
# Inspect once again the types inferred and used by Numba
simple_sum.inspect_types()

We observe similar speedups with other functions that use `for` loops.

For instance, consider the cumulative sum function, which takes an array as input and returns an array as output, rather than a single scalar.

Here, we further take advantage of Numba's ability to efficiently interpret and handle the data types of both input and output arrays.

In [None]:
# Compute the cumulative sum of all elements in an array using Numba
@jit
def simple_cumsum(data):
    # Initialize the output array with the same shape as the input
    out = np.zeros_like(data)
    
    # Variable to store the running sum
    s = 0
    
    # Loop through the input array to compute the cumulative sum
    for n in range(len(data)):
        s += data[n]
        out[n] = s
    
    return out

In [None]:
%%timeit -n 20 -r 5

# Measure the purely pythonic function
simple_cumsum.py_func(input_data)

In [None]:
%%timeit -n 20 -r 5

# Measure the jitted function
simple_cumsum(input_data)

In [None]:
%%timeit -n 20 -r 5

# Measure the NumPy's optimized function
np.cumsum(input_data)

In this case the jit version of the cumulative sum function out-performs the NumPy `cumsum` function by a factor ~two. 

The NumPy function `cumsum` is more versatile than the JIT function, so the comparison is not entirely fair, but it is remarkable that we can reach performance that is comparable to compiled code by JIT compiling Python code with a single function decorator. 

This allows us to use loop-based computations in Python without performance degradation, which is particularly useful for algorithms that are not easily written in vectorized form.

### `nopython` Mode - A More Complex Example

Let's explore a more complex example: evaluating a ***Julia fractal***. 

This requires a variable number of iterations for each point in a matrix of coordinates in the complex plane. A point $z$ in the complex plane belongs to the Julia set if, after repeatedly applying the iteration formula $z \leftarrow z^2 + c$, it does not diverge after a large number of iterations.

To generate a Julia fractal graph, we loop over a grid of coordinate points, apply the iteration $z \leftarrow z^2 + c$, and store the number of iterations required for the value to diverge beyond a given bound (in this case, an absolute value larger than $2.0$).

In [None]:
# Use the `nopython` JIT mode
# Equivalent to `@jit(nopython=True)`
@njit
def julia_fractal(z_re, z_im, j_val):
    # Loop over the real and imaginary components of the grid
    for m in range(len(z_re)):
        for n in range(len(z_im)):
            # Initialize the complex number z with real and imaginary parts
            z = z_re[m] + 1j * z_im[n]
            
            # Perform up to 256 iterations to check divergence
            for t in range(256):
                z = z ** 2 - 0.05 + 0.68j  # Update z using the Julia set formula
                if np.abs(z) > 2.0:  # Check for divergence
                    j_val[m, n] = t  # Store the number of iterations before divergence
                    break

This implementation is simple and straightforward due to the use of explicit loops, but in pure Python, these three nested loops would be prohibitively slow.

However, by leveraging JIT compilation with Numba, we can achieve a significant speed-up.

By default, Numba gracefully falls back on the standard Python interpreter when it fails to generate optimized code. An exception to this occurs when the `nopython=True` argument is used with `numba.jit`. In this case, the JIT compilation will fail if Numba cannot generate fully statically typed code. You can achieve the same effect by explicitly using the `@njit` decorator instead of `@jit`.

When automatic type inference fails, the resulting JIT-compiled code usually offers little to no speed-up. Therefore, it's often advisable to use `nopython=True` (or `@njit`) to ensure that we detect failures early when Numba is unlikely to produce optimized code.

In [None]:
# Set up grid size and initialize the output array for iterations
N = 1024 // 2
j = np.zeros((N, N), np.int64)

# Create linearly spaced values for the real and imaginary components of the grid
z_real = np.linspace(-1.5, 1.5, N)
z_imag = np.linspace(-1.5, 1.5, N)

In [None]:
%%timeit -n 1 -r 1 # Only 1 iteration here please!

# Measure the purely pythonic function
julia_fractal.py_func(z_real, z_imag, j)

In [None]:
# Re-initialize the grid
j = np.zeros((N, N), np.int64)

In [None]:
%%timeit -n 5 -r 5

# Measure the jitted function
julia_fractal(z_real, z_imag, j)

In [None]:
# Plot the Julia fractal result
fig, ax = plt.subplots(figsize=(12, 12))
ax.imshow(j, cmap=plt.cm.RdBu_r, extent=[-1.5, 1.5, -1.5, 1.5])
ax.set_xlabel("$\mathrm{Re}(z)$", fontsize=18)
ax.set_ylabel("$\mathrm{Im}(z)$", fontsize=18)

This speedup is quite remarkable! It's also interesting to note that a fairly-optimized NumPy implementation does not outperform the loop-heavy Numba version. 

This demonstrates the power of JIT compilation in handling complex, loop-based algorithms efficiently, where traditional array-based operations might not always be faster.

In [None]:
# NumPy based Julia set evaluation
def julia_fractal_np(z_re, z_im, j):
    # Create a meshgrid of complex numbers from the real and imaginary components
    Z_re, Z_im = np.meshgrid(z_re, z_im)
    Z = Z_re + 1j * Z_im
    
    # Define the constant for the Julia set formula
    C = -0.05 + 0.68j
    
    # Initialize a mask to track points that haven't diverged yet
    mask = np.ones(Z.shape, dtype=bool)

    # Iterate up to 256 times to check for divergence
    for t in range(256):
        Z[mask] = Z[mask] ** 2 + C  # Update Z only for points that haven't diverged
        mask = np.abs(Z) <= 2.0  # Update the mask to track non-diverged points
        j[mask] = t  # Store the number of iterations before divergence

In [None]:
# Re-initialize the grid
j = np.zeros((N, N), np.int64)

In [None]:
%%timeit -n 3 -r 3

# Measure the NumPy-based implementtion
julia_fractal_np(z_real, z_imag, j)

### Lazy vs. Eager Compilation

The standard usage of the `@jit/@njit` decorator results in _lazy_ compilation.

_Lazy_ compilation means that the function will be compiled only when it is first called. At that point, Numba will infer the argument types based on the actual inputs and generate optimized code accordingly.

To perform _eager_ compilation, we need to explicitly tell Numba the function signature in advance, specifying the expected types.

Returning to the `simple_sum` function, if we expect it to work on a specific data type, we can inform the compiler ahead of time and allow it to perform type-specific compilation.

For example, if we want the function to handle single-precision floats, we can specify this using `float32` or its shorthand `f4`.

Type name(s) |           Shorthand|        Comments|
---|---|---|
`boolean`                 |`b1`              |represented as a byte|
`uint8`, `byte`             |`u1`              |8-bit unsigned byte|
`uint16`                  |`u2`              |16-bit unsigned integer|
`uint32`                  |`u4`              |32-bit unsigned integer|
`uint64`                  |`u8`              |64-bit unsigned integer|
`int8`, `char`              |`i1`              |8-bit signed byte|
`int16`                   |`i2`              |16-bit signed integer|
`int32`                   |`i4`              |32-bit signed integer|
`int64`                   |`i8`              |64-bit signed integer|
`float32`                 |`f4`              |single-precision floating-point number|
`float64`, `double`         |`f8`              |double-precision floating-point number|
`complex64`               |`c8`              |single-precision complex number|
`complex128`              |`c16`              |double-precision complex number|


When declaring a function for eager JIT compilation, both the type and the structure of the input and output data must be specified:

- `type` represents a scalar.
- `type[:]` represents a vector (or 1D array).

For example, the signature `int32(int32[:])` denotes a function that takes a vector of integers as input and returns a scalar integer as the output.

In [None]:
from numba import f4

# Function returning a float32 and taking an array of float32 as input
# This is an alternative to --> @njit("float32(float32[:])")
@njit("f4(f4[:])")
def simple_sum_float32(data):
    s = 0  # Initialize sum
    for d in data:
        s += d  # Accumulate sum
    return s  # Return the result as a float32

In [None]:
# Generate an array of random floats scaled by 10 and converted to float32
input_data_f32 = 10. * np.random.random(100_000).astype('float32')

# Check the data type of the first element in the input_data_f32 array
input_data_f32[0].dtype

In [None]:
%%timeit -n 20 -r 5

# Measure the jitted function
simple_sum_float32(input_data_f32)

### Other Interesting Compiling Options and Usages

In addition to the main Numba compiling options, we also have:

- **`nopython`**: 

This option forces Numba to compile to machine code without relying on the Python `C` API. This results in higher performance code, but it requires that the types of all inputs and outputs can be inferred. If Numba cannot infer the types (for example, if you use a specific Python data type that cannot be resolved at compile time), `@jit` will fall back to the so-called `object mode`, which results in significantly slower performance. Note that `@njit` is simply an alias for `@jit(nopython=True)`.

- **`cache`**: 

This option instructs Numba to save the results of function compilation into a **file-based cache**. When you use a function that has been previously compiled with Numba, it retrieves the compiled version from the cache instead of recompiling. This is especially useful in larger projects where multiple files or notebooks are involved.

- **`parallel`**: 

This option enables Numba to compile a version of the function that will run in parallel across multiple native threads (without the Global Interpreter Lock!). Numba supports _explicit parallel loop declaration_, allowing for embarrassingly parallel computation across loop iterations. You can specify this using `numba.prange` instead of `range` in your `for` loops. 

However, be cautious: using parallel execution does not guarantee a speedup, as it involves threading and may incur overhead from scheduling tasks across processing units. If the input data is small enough, the scheduling overhead might negate any performance gains from threading.

### Example - Gravitational Interaction

Given $N = 1000$ masses scattered in 3D space, we will compute the gravitational force exerted on each mass. 

In this example, we will create a function to calculate the gravitational force on each mass based on its interactions with all other masses in the 3D space.


In [None]:
from numba import prange

# Gravitational constant
G = 6.67430e-11 

# Force Numba to parallelize - remember to use `prange` for the parallel loop
@njit(parallel=True)
def compute_gravitational_forces(masses, positions):
    # Number of particles
    n = masses.shape[0]
    
    # Initialize forces in 3D space (x, y, z)
    forces = np.zeros((n, 3))  

    # Loop over each pair of particles
    for i in prange(n):
        for j in prange(n):
            if i != j:  # Ensure we don't compute force on itself
                r_vec = positions[j] - positions[i]  # Vector from i to j
                r_mag = np.sqrt(np.sum(r_vec ** 2))  # Magnitude of the distance vector
                
                # Calculate the magnitude of the gravitational force
                force_mag = G * masses[i] * masses[j] / r_mag ** 2
                
                # Update the force vector for particle i in the direction of j
                forces[i] += force_mag * r_vec / r_mag  

    return forces

In [None]:
# Number of particles
n_particles = 1_000

# Generate random masses between 1e16 and 1e32
masses = np.random.uniform(1e16, 1e32, n_particles)

# Generate random positions in 3D space between -1e11 and 1e11
positions = np.random.uniform(-1e11, 1e11, (n_particles, 3))

In [None]:
# Plotting the particles in 3D
fig = plt.figure(figsize=(16, 16))
ax = fig.add_subplot(111, projection='3d')

# Plot particles with size proportional to their masses
ax.scatter(positions[:, 0], positions[:, 1], positions[:, 2], 
           color='blue', s=100 * masses / max(masses))  # Scale marker size by mass

# Set labels for the axes
ax.set_xlabel('X Position (m)')
ax.set_ylabel('Y Position (m)')
ax.set_zlabel('Z Position (m)')

# Show the plot
plt.show()

In [None]:
%%time

# Measure the jitted function
forces = compute_gravitational_forces(masses, positions)

In [None]:
%%time

# If you have time... measure the time of the non-jitted function
# forces = compute_gravitational_forces.py_func(masses, positions)

In [None]:
from matplotlib import cm

# Plotting the particles and their corresponding gravitational force vectors
fig = plt.figure(figsize=(16, 16))
ax = fig.add_subplot(111, projection='3d')

# Plot particles with size proportional to their masses
ax.scatter(positions[:, 0], positions[:, 1], positions[:, 2], 
           color='blue', s=200 * masses / max(masses))

# Compute force norms for color mapping
force_norms = np.linalg.norm(forces, axis=1)
normed_forces = (force_norms - force_norms.min()) / (force_norms.max() - force_norms.min())
cmap = cm.get_cmap('hot')

# Plot force vectors for each particle
for i in range(n_particles):
    ax.quiver(
        positions[i, 0], positions[i, 1], positions[i, 2], 
        forces[i, 0], forces[i, 1], forces[i, 2], 
        color=cmap(normed_forces[i]),  # Color based on the normalized force magnitude
        length=2e10,  # Scale length of vectors for better visualization
        normalize=True  # Normalize vectors for consistent representation
    )

# Set labels for the axes
ax.set_xlabel('X Position (m)')
ax.set_ylabel('Y Position (m)')
ax.set_zlabel('Z Position (m)')

# Show the plot
plt.show()

## @vectorize

### The Basics

In NumPy, universal functions, or `ufunc`s, are functions that operate on ndarrays in an element-wise fashion. Examples include the addition of two ndarrays or the multiplication of an ndarray with a scalar.

Under the hood, NumPy implements these `ufunc`s in pure `C`, allowing them to efficiently process each element of the ndarray, potentially in a vectorized manner.

Writing a custom `ufunc` in NumPy can be quite complex, as it requires a deeper understanding of the underlying implementation than what typical users are accustomed to.

Numba simplifies this process by allowing you to create "vector" versions of Python functions that operate on all scalar elements of a NumPy array with the same speed as `ufunc`s.

Using the `@vectorize` decorator, Numba can compile a pure Python function into a `ufunc` that operates over NumPy arrays as efficiently as traditional `ufunc`s written in C. The `@vectorize` decorator is especially useful for creating ufuncs that are not merely combinations of existing NumPy operations.

The same considerations regarding _lazy_ vs _eager_ compilation apply in this context as well.

Let's start by vectorizing a trivial `simple_add` function using Numba's `@vectorize` decorator. We will eagerly compile it to handle both integers (32 and 64 bits) and floats (32 and 64 bits).

The following implementation will demonstrate how to define a function that adds two numbers and can be applied element-wise to NumPy arrays.

In [None]:
from numba import vectorize, int32, int64, float32, float64

# Vectorize the addition of two values
# Specialize this function to add either integers (32 and 64 bits)
# or floats (single or double precision)
@vectorize(['int32(int32, int32)',
            'int64(int64, int64)',
            'float32(float32, float32)',
            'float64(float64, float64)'])
def simple_add(x, y):
    return x + y

In [None]:
# Generate two random arrays of floats
a = 10 * np.random.random(100_000)
b = 20 * np.random.random(100_000)

In [None]:
# Use the vectorized simple_add function to add arrays a and b
c = simple_add(a, b)

# Display the result
c

#### Differences Between `@vectorize` and `@jit`

One common question is whether there is any difference between using `@vectorize` and writing a `@jit` function that operates on all elements of the ndarrays.

There is a very interesting and important distinction that arises when using `@vectorize`: all NumPy `ufunc`s automatically support additional features such as **reduction**, **accumulation**, and **broadcasting**.

Using the example above, let's explore how these features enhance the functionality of the `@vectorize` approach.

In [None]:
# Using broadcasting: Adding a scalar to the array b
a_1 = 1  # Scalar value
c = simple_add(a_1, b)  # The scalar a_1 will be broadcasted to each element of b

# Display the result
c

In [None]:
# Using broadcasting with reshaped arrays
a_2 = a.reshape(100, 1_000)  # Reshape array a into a 2D array of shape (100, 1,000)
b_2 = b.reshape(100, 1_000)  # Reshape array b into a 2D array of shape (100, 1,000)

# Perform element-wise addition using the vectorized simple_add function
c = simple_add(a_2, b_2)

# Display the result
c

In [None]:
# Performing reduction using the simple_add function
# This sums the elements of array a along the specified axis (0 in this case)
result_reduction = simple_add.reduce(a, axis=0)

# Display the result of the reduction
result_reduction

The `vectorize` decorator in Numba supports multiple targets for executing the vectorized function:

- **`cpu`**: Runs the vectorized function in a single-threaded manner on the CPU.

- **`parallel`**: Executes the vectorized function in a multi-threaded manner across all available CPUs (without the Global Interpreter Lock, or GIL). It is important to note that this approach involves threading, which can introduce a small overhead due to the scheduling of tasks across multiple threads. This option is beneficial when the dataset is large enough to "hide" the latency associated with the initial scheduling.

- **`cuda`**: Enables the vectorized function to run on a GPU as a dedicated CUDA kernel, allowing for significant performance improvements in suitable scenarios.

### Example - Invert Image

Given an RGB image, invert its colors using Numba's `@vectorize` decorator. 
The idea is to create a function that takes an RGB image as input and returns a new image with inverted colors. This is achieved by subtracting each color channel's values from the maximum value, which is typically 255 for an 8-bit image.


In [None]:
# Define a vectorized function to invert the colors of an image
@vectorize(['uint8(uint8)'], target='parallel')
def invert_color(pixel):
    return 255 - pixel  # Invert the color by subtracting from 255

In [None]:
# Load a sample image using Matplotlib
image = plt.imread('sample_image.png')  # Read the image file
image.shape  # Display the shape of the image

In [None]:
# Check the data type of the image array
image.dtype  # This will show the data type of the image

In [None]:
# Convert the image to uint8 format (values between 0 and 255)
if image.dtype != np.uint8:
    image = (image * 255).astype(np.uint8)  # Scale and convert to uint8 if necessary

In [None]:
%%time

# Apply the vectorized function to invert the colors of the image
inverted_image = invert_color(image)  # This will run the color inversion in parallel

In [None]:
# Plot the original and inverted images side by side
fig, ax = plt.subplots(1, 2, figsize=(20, 10))

ax[0].imshow(image)  # Display the original image
ax[0].set_title('Original Image')  
ax[0].axis('off')  

ax[1].imshow(inverted_image)  # Display the inverted image
ax[1].set_title('Inverted Image')  
ax[1].axis('off')  

plt.show()  

### Extending it to `@guvectorize`

The `@vectorize` decorator enables the definition and compilation of element-wise functions for faster execution. To work with ndarrays more flexibly, Numba offers the `@guvectorize` decorator, which allows you to write universal functions (ufuncs) that can operate on an arbitrary number of elements of input arrays and return arrays of potentially differing dimensions.

One key point to note about `@guvectorize` functions is that, unlike `@vectorize` functions, `guvectorize` functions do not return their result values. Instead, they function similarly to "void" functions: they take the result/output values as input parameters and fill them. This behavior is akin to kernels in CUDA.

This design choice is because the array is allocated by NumPy’s dispatch mechanism, which calls into the Numba-generated code to perform the operations.

One very simple example, useful to understand the logic of this, can be the sum of a scalar value to all elements of an array:

In [None]:
# Define a guvectorized function to add a scalar to each element of an input array
@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)', target='parallel')
def gu_vector_add(x, y, res):
    for i in range(x.shape[0]):
        res[i] = x[i] + y  # Add the scalar y to each element of array x
    pass  # Do not return res; the output is filled in the provided res array

The declaration of the input and output layouts is usually reported in symbolic form: `(n),()->(n)`. 

- `()` represents a scalar (in this case, of type `int64`)
- `(n)`, `(m)`, etc. represent 1D arrays (e.g., `int64[:]`)
- `(n,n)`, `(n,m)`, etc. represent 2D arrays (e.g., `int64[::]`)
- ...

The previous definition instructs NumPy that the function takes a one-dimensional array of n elements `(n)`, a scalar `()`, and returns a one-dimensional array of the same number of elements as the input array `(n)`.

As always in Numba, you can pass a list of supported concrete signatures to specify for which types Numba should compile the function.

In [None]:
# Define a scalar value
sca = 10

# Create a random integer array with values between 0 and 100, of size 100,000
arr = np.random.randint(low=0, high=100, size=100_000)

In [None]:
%%time

# Apply the guvectorized function to add the scalar to the array
result = gu_vector_add(arr, sca)

### Example - 1D Denoising

It is given a 1D signal composed of a single frequency and some additional Gaussian noise. 

We will apply a moving average filter to denoise the signal. 

The moving average filter will help smooth out the noise by averaging the values within a specified window size.

In [None]:
# Generate a realistic noisy signal (sine wave with added noise)
np.random.seed(123)  # For reproducibility
t = np.linspace(0, 4 * np.pi, 500)  # Time values
signal = np.sin(t) # Sine wave

# Visualization
plt.figure(figsize=(12, 3))

# Plot the original signal
plt.plot(t, signal, color='green', label='Original Signal')
plt.title('Original Signal')
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
# Add some Gaussian noise
noise = np.random.normal(scale=0.5, size=t.shape)
noisy_signal = signal + noise

# Visualization
plt.figure(figsize=(12, 3))

# Plot the noisy signal
plt.plot(t, signal, color='green', label='Original Signal')
plt.plot(t, noisy_signal, color='grey', label='Noisy Signal')
plt.title('Noisy Signal')
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
# Define a simple moving average kernel
kernel_size = 20
kernel = np.ones(kernel_size) / kernel_size  # Average kernel --> [1/20, 1/20, 1/20, ...]

# Visualization
plt.figure(figsize=(12, 3))

# Plot the noisy signal and an instance of a kernel function overimposed
plt.plot(t, signal, color='green', label='Original Signal')
plt.plot(t, noisy_signal, color='grey', label='Noisy Signal')
plt.plot(t[:kernel_size], kernel, color='red', label='Kernel')
plt.title('Kernel')
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
# Define the convolution function using guvectorize
@guvectorize([(float64[:], float64[:], float64[:])], '(n),(k)->(n)', target='parallel')
def convolve_1d(signal, kernel, result):
    n = signal.shape[0]  # Length of the input signal
    k = kernel.shape[0]  # Length of the convolution kernel
    half_k = k // 2  # Half the kernel size for centering

    for i in range(n):
        tmp = 0.0  # Temporary variable to hold the convolution result
        for j in range(k):
            # Check bounds to prevent index errors
            if 0 <= i - half_k + j < n:
                tmp += signal[i - half_k + j] * kernel[j]  # Perform the convolution
        result[i] = tmp  # Store the result

In [None]:
# Output array to hold the convolution result
smoothed_signal = np.empty_like(noisy_signal)

# Perform the convolution
convolve_1d(noisy_signal, kernel, smoothed_signal)

# Plot the final result
plt.figure(figsize=(12, 3))

plt.plot(t, signal, color='green', label='Original Signal')
plt.plot(t, noisy_signal, color='grey', label='Noisy Signal')
plt.plot(t, smoothed_signal, color='blue', label='Smoothed Signal')
plt.title('Smoothed Signal with Moving Average')
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.legend()

plt.tight_layout()
plt.show()
