# Numpy vs Python Lists - Performance Test

In [1]:
import numpy as np 
import time

## Python Zip Example

In [2]:
# Python Zip Explained
l1 = [1, 2, 4]
l2 = [6, 7, 8]
list(zip(l1,l2))

[(1, 6), (2, 7), (4, 8)]

## Using Python List evaluating speed

In [3]:
# Using Python List
size = 1_000_000

l1 = list(range(size))
l2 = list(range(size))

start = time.time()

add = [x+y for x,y in zip(l1, l2)]
end = time.time()
print(add[0:10])

print(f"Time taken: {end - start:.2f} seconds")

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
Time taken: 0.07 seconds


## Using Numpy Array evaluating speed

In [4]:
# Using Numpy Array
size = 1_000_000

l1 = np.array(list(range(size)))
l2 = np.array(list(range(size)))

start = time.time()

add = l1 + l2
print(add[0:10])
end = time.time()

print(f"Time taken: {end - start:.2f} seconds") 

[ 0  2  4  6  8 10 12 14 16 18]
Time taken: 0.02 seconds


### Creating one dimension array (Simple Matrix)

In [5]:
np.array([1,2,3,4,5])

array([1, 2, 3, 4, 5])

### Creating two dimension array (2D Matrix) - Example

In [6]:
np.array([[1,2,3], [4,5,6]])

array([[1, 2, 3],
       [4, 5, 6]])

### Creating three dimension array (3D Matrix) - Example

In [7]:
arr = np.array([[1,2,3], [4,5,6], [7,8,9]])
print(f"Object Type: {type(arr)}")
print(f"{arr}")
print(f"Shape of Matrix: {arr.shape}")

Object Type: <class 'numpy.ndarray'>
[[1 2 3]
 [4 5 6]
 [7 8 9]]
Shape of Matrix: (3, 3)


## Memory Efficiency - Numpy vs Lists

Let's Check memory consumption

In [8]:
import sys 
list_data = list(range(1000))
numpy_data = np.array(list_data)
print("Python list size:", sys.getsizeof(list_data) * len(list_data), "Bytes")
print("Numpy array size:", numpy_data.nbytes, "Bytes")

Python list size: 8056000 Bytes
Numpy array size: 8000 Bytes


We can cleary see the difference on how well Numpy optimizing memory space it can effect alot on large computations

## What is Vectorization in Numpy?

Vectorization in NumPy refers to the process of performing mathematical operations on entire arrays (or vectors/matrices) at once, rather than iterating through each element individually using loops. This is a core concept in NumPy that makes it incredibly efficient for data science tasks like data manipulation, statistical analysis, and machine learning. 

NumPy's ability to apply operations element-wise across arrays in a single, optimized step, which is executed using low-level, compiled code (mostly in C) behind the scenes. This avoids the overhead of Python's interpreter loop, leading to significant performance gains.

SIMD (Single Instruction, Multiple Data) is a key hardware-level feature in modern CPUs (and GPUs) that enables parallel processing of data. It's one of the main reasons why NumPy's vectorized operations are so fast, as NumPy is designed to exploit SIMD capabilities under the hood.

### Example: 
Imagine you're adding two lists of numbers. In non-SIMD (scalar) mode, it's like adding one pair at a time: 1+5, then 2+6, etc. With SIMD, it's like loading 4 (or more) pairs into a "lane" and adding them all in one go: (1+5, 2+6, 3+7, 4+8) with a single instruction. This is efficient because the CPU has special registers (like 128-bit or 256-bit wide) that can hold multiple values (e.g., four 32-bit floats).

In [9]:
#Vectorization Comparison between Python and Numpy

size1 = 1_000
list1 = list(range(size1))
arr1 = np.array(list1)

# Python List (Loop-based)
start1 = time.time()
list_square = [x ** 2 for x in list1]
end1 = time.time()
#print(f"Python Loop result: {list_square}")
print(f"Time taken: {end1 - start1} seconds") 

#Numpy (vectorized)
start2 = time.time()
numpy_square = arr1 ** 2
end2 = time.time()
#print(f"Numpy Array result: {numpy_square}")
print(f"Time taken: {end2 - start2} seconds") 

Time taken: 0.0 seconds
Time taken: 0.0 seconds


This significant performance difference arises because NumPy operations are pre-compiled into C and leverage CPU features like AVX (Advanced Vector Extensions) for SIMD execution, processing multiple data points simultaneously without the slow overhead of the Python interpreter's loop mechanics.

When working with big data, relying on native Python loops can lead to prohibitive execution times and poor efficiency, making the highly optimized, C-based NumPy vectorization essential for maintaining computational speed and managing large datasets effectively.

Vectorization using frameworks like NumPy is essential for data science and machine learning workflows because it allows complex mathematical operations to be executed on entire datasets simultaneously, leveraging CPU optimizations like AVX to achieve massive performance gains and efficiency compared to native Python loops.