1. Explain the difference between NumPy arrays and Python lists. What advantages does NumPy provide in data engineering workflows?

NumPy arrays are homogeneous, fixed type → faster & more memory-efficient

Support vectorized operations & broadcasting

Backed by fast C/Fortran libraries

In [5]:
import numpy as np
import time

lst = list(range(10_000_000))
arr = np.array(lst)

# Python loop
start = time.time()
lst = [x * 2 for x in lst]
print("List time:", time.time() - start)

# NumPy vectorized
start = time.time()
arr = arr * 2
print("NumPy time:", time.time() - start)


List time: 0.4470326900482178
NumPy time: 0.012510299682617188


2. How does broadcasting work in NumPy? Provide an example where broadcasting simplifies a computation.

NumPy auto-expands shapes for compatible operations.

In [6]:
import numpy as np

a = np.arange(4)           # [0 1 2 3]
b = 2                      # scalar

print(a + b)               # [2 3 4 5]

[2 3 4 5]


In [8]:
#matrix broadcasting
A = np.array([[1, 2, 3],
              [4, 5, 6]])
b = np.array([1, 1, 1])
print(A + b)


[[2 3 4]
 [5 6 7]]


3. What are views vs. copies in NumPy? How do slicing operations affect memory usage and performance?

Slicing returns a view — no memory copied

.copy() forces copying

In [9]:
arr = np.arange(10)
view = arr[2:6]
view[0] = 99
print(arr)  # → view changed arr

copy_arr = arr[2:6].copy()
copy_arr[0] = 111
print(arr)  # → original unchanged


[ 0  1 99  3  4  5  6  7  8  9]
[ 0  1 99  3  4  5  6  7  8  9]


4. How can you handle missing or invalid values in NumPy arrays? What are common patterns or functions available?

NumPy uses np.nan for missing data.

In [10]:
data = np.array([1, 2, np.nan, 4, np.nan])

print(np.isnan(data))        # detect missing
print(np.nanmean(data))      # mean ignoring NaN
print(np.nan_to_num(data))   # replace NaN with 0

[False False  True False  True]
2.3333333333333335
[1. 2. 0. 4. 0.]


5. Explain vectorization in NumPy. Why is it faster than using Python loops?

Replace loops with optimized array operations.

In [11]:
arr = np.arange(5)
print(arr ** 2)  # square every element without a loop

[ 0  1  4  9 16]


6. What are structured and record arrays? In what scenarios would you use them?

Useful for table-like data in pure NumPy.

In [12]:
data = np.array([(1, "Alice", 50.5),
                 (2, "Bob", 75.0)],
                dtype=[('id', 'i4'), ('name', 'U10'), ('score', 'f4')])

print(data['name'])   # extract a column


['Alice' 'Bob']


7. Describe how memory layout works in NumPy (C-order vs. Fortran-order). When does it matter?

C-order: row-major

Fortran-order: column-major

Matter for performance in linear algebra

In [13]:
arr_c = np.ones((3,3), order='C')
arr_f = np.ones((3,3), order='F')

print(arr_c.flags)
print(arr_f.flags)


  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False



8. How do you integrate NumPy with other data engineering tools (e.g., Pandas, Spark, Dask)?

In [16]:
#Pandas → NumPy
import pandas as pd
df = pd.DataFrame({"a": [1,2,3]})
arr = df["a"].to_numpy()
arr

array([1, 2, 3], dtype=int64)

9. Explain how NumPy uses underlying BLAS/LAPACK libraries. How does that influence performance for linear algebra tasks?

In [17]:
import numpy as np

A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

C = A @ B  # matrix multiplication (BLAS)