# NumPy

## Understanding data storage
Effective data-driven science and computation requires understanding how data is stored and manipulated. This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy and Pandas improve upon the base data structures.

Users of Python are often drawn-in by its ease of use, one piece of which is dynamic typing. While a statically-typed language like C or Java requires explicit variable type declarations, a dynamically-typed language like Python skips this specification. For example, in C one might specify a particular operation as follows:

While in Python the equivalent operation could be written as follows:

In [1]:
# Python code
result = 0
for i in range(100):
    result += i

Notice the main difference: in C, variable data types are explicitly declared, while in Python they are dynamically inferred from their values. This means that we can assign any kind of data to any variable:

In [7]:
# Python code
x = 4
x = "four"

But this flexibility comes at a cost. A single integer contains four pieces (leading to overhead):
- ```ob_refcnt```, a reference count that helps Python silently handle memory allocation and deallocation
- ```ob_type```, which encodes the variable type
- ```ob_size```, which specifies the size of the following data members
- ```ob_digit```, which contains the actual integer value for the Python variable.

## NumPy arrays
NumPy is a tool to more effiently handle numerical data in Python. It has a closer connection to C and stronger restrictions on data types - i.e., must be numerical and the same type.

In [6]:
import numpy as np
# integer array:
int_array = np.array([1, 4, 2, 5, 3])
print("Array is:", int_array)
print("Array data type is:", int_array.dtype)

Array is: [1 4 2 5 3]
Array data type is: int32


If types do not match, NumPy will *up-cast* the array. For example, considering the following array containg a mix of integers and floating point numbers. NumPy will *up-cast* the array to be floats.

In [12]:
mixed_array = np.array([3.14, 4, 2, 3])
print("Array is:", mixed_array)
print("Array data type is:", mixed_array.dtype)

Array is: [3.14 4.   2.   3.  ]
Array data type is: float64


Unlike Python lists, NumPy arrays can explicitly be multi-dimensional; here's one way of initializing a multidimensional array using a list of lists:

In [13]:
# Nested lists result in multi-dimensional arrays
# range() generates a sequence of numbers
# Use shorthand loop syntax to susinctly generate a list of lists
np.array([range(i, i + 3) for i in [2, 4, 6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

## Creating NumPy arrays from scratch
It is often more efficient to create arrays from scratch using built-in NumPy routines. Array values can be updated by subsequent code.

In [14]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [15]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [16]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [17]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to Python's built-in range() function)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [18]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [19]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

array([[0.65531042, 0.9865746 , 0.23864363],
       [0.12425026, 0.07339364, 0.74040114],
       [0.37099621, 0.45801464, 0.70536856]])

In [20]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[ 0.50964334, -1.97406049,  0.81231851],
       [-0.12392937,  1.71716507, -2.12793696],
       [ 0.42057098,  0.49227967,  0.56654384]])

In [21]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

array([[4, 5, 1],
       [4, 5, 6],
       [5, 7, 2]])

In [22]:
# Create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [23]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)

array([1., 1., 1.])

## Basic NumPy array attributes

In [31]:
import numpy as np
np.random.seed(0)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

In [32]:
print("x3 ndim: ", x3.ndim) # number of array dimensions
print("x3 shape:", x3.shape) # size of each array dimension
print("x3 size: ", x3.size) # total array size

x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60


## Array Indexing

In [33]:
x1[0] # Give me the value at the 0-th array location

5

In [34]:
x1[4] # Give me the value at the 4-th array location

7

In [35]:
x1[-1] # Give me the last array element

9

In a multi-dimensional array, elements can be accessed using comma-separated typle indices.

In [36]:
x2[0,0] # Give mt the value at array location [0,0]

3

Elements can also be updated using the same syntax

In [40]:
print("Value before:", x2[0,0])
x2[0,0] = 4.1 # Note: this value is down-cast to an integer to match the declared array data type
print("Value after:", x2[0,0])

Value before: 4
Value after: 4


Array slicing is a tool to access subarrays. Remember that array indices begin at 0. Note that ```:``` is the slice operator and the upper range value is not included in the slice - e.g., 1:3 will give us back element 1 (actually 2) and 2 (actually 3).

In [44]:
print("Option 1:\n", x2[1:3,]) # Give me the elements for rows 1-2 (zero-indexed, so actually 2-3) for all columns
start = 1
end = 3
print("Option 2:\n", x2[start:end,]) # Integer variables can also be defined as used as slice indices

Option 1:
 [[7 6 8 8]
 [1 6 7 7]]
Option 2:
 [[7 6 8 8]
 [1 6 7 7]]


More complex indexing is also possible using ```::i``` syntax, which tells Python to give us back every i-th column (row)

In [52]:
print("Forward:\n", x2[:,::2]) # Gives me every row and every second column starting with column 0
print("Backward:\n", x2[:,::-2]) # Gives me every row and every second column starting with last column

Forward:
 [[4 2]
 [7 8]
 [1 7]]
Backward:
 [[4 5]
 [8 6]
 [7 6]]


## Reshaping arrays

In [57]:
grid = np.arange(1, 10).reshape((3, 3)) # numbers between 1 to 9 in a 3x3 grid
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


The reshaped array size must match the original array size.

In [59]:
grid.reshape((3,1))

ValueError: cannot reshape array of size 9 into shape (3,1)

In [61]:
grid.reshape((9,1))

array([[1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

A slight nuisance when working with NumPy is its treatment of 1D arrays. It is often necessary to add a second dimension using ```newaxis``` to perform array operations.

In [63]:
x = np.array([1,2,3])
print("Shape before:", x.shape)
y = x[np.newaxis,:]
print("Shape after:", y.shape)

Shape before: (3,)
Shape after: (1, 3)


In many applications, we have several arrays that we want to combine (concatenate) to form a single array.

In [65]:
x = np.array([1,2,3])
y = np.array([3,2,1])
np.concatenate([x,y])

array([1, 2, 3, 3, 2, 1])

We can also concatenate two-dimensional arrays but need to be aware of which dimension the arrays are being concatenated along. By default, NumPy will concatenate along the rows first (stores data in memory as row major, C style).

In [68]:
arr1 = np.arange(1, 10).reshape((3, 3))
arr2 = np.arange(21, 30).reshape((3, 3))
print("Default concatentation\n",np.concatenate([arr1,arr2]))
print("Default concatentation\n",np.concatenate([arr1,arr2],axis=1)) # Axes are zero-based

Default concatentation
 [[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [21 22 23]
 [24 25 26]
 [27 28 29]]
Default concatentation
 [[ 1  2  3 21 22 23]
 [ 4  5  6 24 25 26]
 [ 7  8  9 27 28 29]]


When dealing with 2D array, it's often clearer to work with the ```np.vstack()``` and ```np.hstack()``` functions. These functions will work for higher dimensional arrays, but they do not necessarily improve clarity in such instances.

In [69]:
print("np.vstack()\n",np.vstack([arr1,arr2]))
print("np.hstack()\n",np.hstack([arr1,arr2]))

np.vstack()
 [[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [21 22 23]
 [24 25 26]
 [27 28 29]]
np.hstack()
 [[ 1  2  3 21 22 23]
 [ 4  5  6 24 25 26]
 [ 7  8  9 27 28 29]]


## Computation with NumPy

Python can be surprisingly slow, particuarly when repeatedly performing small operations. Considering a simple function that computes the reciprical for an array of integers.

In [70]:
import numpy as np
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output
        
values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

array([0.16666667, 1.        , 0.25      , 0.25      , 0.125     ])

In [71]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array) # Use IPython %timeit to monitor runtime

2 s ± 52.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Python is looking up the data type each time it runs the reprical calculation. NumPy offers a *vectorized* implementation that is much faster. The operations is performed on the full array and looping over individual elements is pushed into the compiled layer.

In [72]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit (1.0/big_array) # Use IPython %timeit to monitor runtime

3 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


NumPy functions are intuitive to use because they use Python's built-in arithematic operations

In [80]:
x = np.arange(4)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division
print("-x= ", -x)
print("x ** 2 = ", x ** 2) # exponential
print("x % 2  = ", x % 2) # modulus
print("e^x =", np.exp(x))
print("3^x =", np.power(3, x))
print("ln(x) =", np.log(x))
print("log10(x) =", np.log10(x))

x = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]
-x=  [ 0 -1 -2 -3]
x ** 2 =  [0 1 4 9]
x % 2  =  [0 1 0 1]
e^x = [ 1.          2.71828183  7.3890561  20.08553692]
3^x = [ 1  3  9 27]
ln(x) = [      -inf 0.         0.69314718 1.09861229]
log10(x) = [      -inf 0.         0.30103    0.47712125]


  print("ln(x) =", np.log(x))
  print("log10(x) =", np.log10(x))


## Matrix operations

In [138]:
w = np.random.randint(0, 10, (3, 3)) # 3x3 matrix
x = np.random.randint(0, 10, (3, 3)) # 3x3 matrix
y = np.random.randint(0, 10, (3)) # 3x1 vector
z = np.random.randint(0, 10, (3)) # 3x1 vector

In [140]:
print("Matrix product of w * x using @\n", w@x)
print("Matrix product of w * y using @\n", w@y)
print("Elementwise product of x * z using *\n", x*z)
print("Inner product of y * z using np.inner() \n", np.inner(y,z))
print("Outer product of y * z using np.outer() \n", np.outer(y,z))

Matrix product of w * x using @
 [[29 64 48]
 [50 63 52]
 [42 81 40]]
Matrix product of w * y using @
 [52 50 71]
Elementwise product of x * z using *
 [[28 35  0]
 [ 0 35  4]
 [21 21  8]]
Inner product of y * z using np.inner() 
 64
Outer product of y * z using np.outer() 
 [[35 35  5]
 [28 28  4]
 [ 7  7  1]]


## Aggregations: Min, Max, Etc.

In [146]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

68.5 ms ± 2.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
537 µs ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [147]:
np.min(big_array), np.max(big_array)

(6.036102300210899e-07, 0.9999969605678884)

In [148]:
M = np.random.random((3, 4))
print(M)

[[0.90417929 0.90436453 0.19680251 0.42850848]
 [0.77015982 0.78572466 0.93921145 0.76088489]
 [0.24667418 0.97096218 0.72598625 0.56952897]]


In [149]:
M.sum() # Sum across all dimensions

8.202987209152711

In [150]:
M.min(axis=0) # Minimum for each column

array([0.24667418, 0.78572466, 0.19680251, 0.42850848])

In [None]:
M.max(axis=1) # Minimum for each row

Many other aggregation functions are available: percentile, median, mean, standard deviation, ...

## Comparison operators
These are useful when we want to filter an array (or use it as a *boolean mask function*).

In [157]:
x = np.array([1, 2, 3, 4, 5])
x < 3 # Is x less than 3?

array([ True,  True, False, False, False])

In [158]:
x[x<3] # Filter elements of x that are less than 3

array([1, 2])

## Conclusions
NumPy is a powerful data tool. There are many other functionalities available within the NumPy package, which are well-documented online.

$variable_4+\int_i^jx+\sum_iX$

$

## References
https://jakevdp.github.io/PythonDataScienceHandbook/