# NumPy for pandas

pandas is built atop NumPy and consequently one should learn about NumPy in the process of learning pandas.

# Installing and Importing NumPy

NumPy will have already been installed if pandas has been installed, so there's no need to worry about that.

In [76]:
# Import top-level NumPy functions using the np. prefix.
import numpy as np

## Benefits and Characteristics of NumPy Arrays

NumPy arrays are the bees knees when compared to ordinary Python lists. The nature of these benefits are primary performance and manipulation-oriented. Some benefits are outlined here:

* Continguous allocation in memory
* Vectorized operations
    * A technique for applying an operation across all or a subset of elements in an array. It's a hell of a lot faster than using for loops
* Boolean selection
    * Select elements from an array according to logical criteria.
* Sliceability
    * The ability to select elements in an array with a nice and tidy notation.

## Creating NumPy Arrays and Performing Basic Array Operations

There's more than one way to create a NumPy array.

In [77]:
# Create a basic array.
a1 = np.array([1, 2, 3, 4, 5])
a1

array([1, 2, 3, 4, 5])

In [78]:
# Show the array's type.
type(a1)

numpy.ndarray

In [79]:
# How many elements are there?
np.size(a1)

5

NumPy refers to n-dimensional arrays as `ndarray` and the one above comprises five elements, as reported by the `np.size()` function. Attempting to create an array with more than one data type prompts coercion.

In [80]:
# Mixed integers and floats results in float coercion.
a2 = np.array([1, 2, 3, 4.0, 5.0])
a2

array([ 1.,  2.,  3.,  4.,  5.])

In [81]:
# Examine the type of the array elements.
a2.dtype

dtype('float64')

There are more ways to create arrays.

In [82]:
a3 = np.array([0] * 10)
a3

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [83]:
# Create an array using the range() function.
np.array(range(10))

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

An efficient means to create an array of zeros (sometimes useful) is to the `np.zeroes()` function.

In [84]:
# Create a numpy array of ten zeroes.
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

The `np.arange()` function is more effecient than passing a `range()`-generated Python list and is syntactically the same.

In [85]:
# Non-inclusive end point, starting from zero.
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [86]:
# Explicit start (inclusive) and end (non-inclusive).
np.arange(6, 14)

array([ 6,  7,  8,  9, 10, 11, 12, 13])

In [87]:
# Start, stop, and step.
np.arange(2, 21, 2)

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

In [88]:
# Start, stop, and step going in reverse.
np.arange(10, -1, -1)

array([10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0])

The `np.linspace()` function is broadly similar to `np.arange()`, but generates an array based on start, stop, and division values. Values deafult to floats.

In [89]:
np.linspace(0, 21, 6)

array([  0. ,   4.2,   8.4,  12.6,  16.8,  21. ])

The n-dimensional arrays that I'll working with will predominantly be one or two-dimensional. Making a 2D NumPy array is as easy as passing a list of lists.

In [90]:
# Create a 2x2 two-dimensional array.
np.array(
    [[1, 2],
     [3, 4]]
)

array([[1, 2],
       [3, 4]])

A more efficient approach than this mess of parenthesis and brackets is the `.reshape()` method that can be applied to a one-dimensional array.

In [91]:
# Create a 1x20 array and reshape it into a 5x4 2D array.
m = np.arange(0, 20).reshape(5, 4)
m

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [92]:
# Examine the size (product of the rows and columns).
np.size(m)

20

In [93]:
# Examine the shape.
np.shape(m)

(5, 4)

The number of rows or columns can be singly returned by using the `size()` function and specifying the axis, which is 1 for rows and 0 for columns.

In [94]:
# Return number of rows.
np.size(m, 1)

4

### Selecting Array Elements

NumPy arrays can have their elements selected in a zero-based fashion via the `[]` operator.

In [95]:
# Select elements at positions 0 and 2.
a1[0], a2[2]

(1, 3.0)

In [96]:
# Select an element in row 1, column 2.
m[1, 2]

6

Entire rows can be selected (colon optional).

In [97]:
# Select all items in row 1.
m[1,:]

array([4, 5, 6, 7])

Entire columns can be selected as well (colon not optional).

In [98]:
# Select all items in column 3.
m[:,3]

array([ 3,  7, 11, 15, 19])

### Logical Options on Arrays

Array values can be tested against logical criteria.

In [99]:
# Test which items are less than 2.
a = np.arange(5)
a < 2

array([ True,  True, False, False, False], dtype=bool)

More complicated, yet nevertheless Pythonic, expressions will cause problems.

In [100]:
# This will throw an error.
# a < 2 or a > 3

We need to use `|`  and parentheses rather than `or` to get the desired result.

In [101]:
(a < 2) | (a > 3)

array([ True,  True, False, False,  True], dtype=bool)

NumPy has the `vectorize()` function that takes a function or expression as an arguments and applies it to an array in a vectorized manner. The syntax looks a bit weird in my opinion.

In [102]:
def logical(x):
    return x < 2 or x > 3

np.vectorize(logical) (a)

array([ True,  True, False, False,  True], dtype=bool)

Boolean arrays can be used as selection criteria for other arrays.

In [103]:
# Test if values are less than 3.
r = a < 3

# Use the boolean array to select values from another array.
a[r]

array([0, 1, 2])

The number of `True` values can be counted using the `np.sum()` function. So useful!

In [104]:
# Count how many values are less than 3.
# Achievable, because True is equivalent to 1.
np.sum(a < 3)

3

Arrays can be compared against other arrays.

In [105]:
a1 = np.arange(0, 5)
a2 = np.arange(5, 0, -1)
a1 < a2

array([ True,  True,  True, False, False], dtype=bool)

Multidimensional logical arrays can be created thusly.

In [106]:
# Create two 3x3 multidimensional arrays.
a1 = np.arange(9).reshape(3, 3)
a2 = np.arange(9, 0, -1).reshape(3, 3)
a1 < a2

array([[ True,  True,  True],
       [ True,  True, False],
       [False, False, False]], dtype=bool)

### Reshaping Arrays

NumPy makes it trivial to reshape one-dimensional arrays into matrices and back again.

In [107]:
# Create a 1x9 array.
a = np.arange(0, 9)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [108]:
# Reshape the array as a 3x3 matrix.
m = a.reshape(3, 3)
m

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [109]:
# And back again.
reshaped = m.reshape(9)
reshaped

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

The `.ravel()` method is another way to flatten a matrix. Probably the better way to do it, actually.

In [110]:
raveled = m.ravel()
raveled

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

<h1 align='center'>⚠ WARNING ⚠</h1>

The `.reshape()` and `.ravel()` methods return a new *view* of an array that still points to the original, not an entirely new array. This means that changes made to a reshaped or raveled array will affect the original array!

The `.flatten()` method, on the other hand, will return an entirely new array, so changes made there will not affect the original array.

In [111]:
m1 = np.arange(9).reshape(3, 3)
m2 = np.arange(9).reshape(3, 3)
print(m1)
print(m2)

[[0 1 2]
 [3 4 5]
 [6 7 8]]
[[0 1 2]
 [3 4 5]
 [6 7 8]]


In [112]:
raveled = m1.ravel()
raveled[0] = 1000

flattened = m2.flatten()
flattened[0] = 1000

print(raveled)
print(flattened)

[1000    1    2    3    4    5    6    7    8]
[1000    1    2    3    4    5    6    7    8]


In [113]:
print(m1)
print(m2)

[[1000    1    2]
 [   3    4    5]
 [   6    7    8]]
[[0 1 2]
 [3 4 5]
 [6 7 8]]


The `.shape` property returns a tuple with the dimensions of an array.

In [114]:
# Show the dimensions of an array or matrix.
flattened.shape

(9,)

A tuple can be assigned to this property, for a reshape in place. Probably better to stick to `.reshape()` in my opinion.

In [115]:
# De-flatten the array.
flattened.shape = (3, 3)
flattened

array([[1000,    1,    2],
       [   3,    4,    5],
       [   6,    7,    8]])

The `.resize()` method functions similarly to the `.reshape()` method, except that `.resize()` modifies an array **in-place** and `.reshape()` returns a **new** array. I wonder how long it will take me to remember this distinction.

In [116]:
# Create an array and then resize it in-place.
m = np.arange(0, 25)
m.resize(5, 5)
m

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

### Combining Arrays

Stacking is the NumPy nomenclature for combining arrays, and there are few ways to go about it. Stacking can be accomplished horizontally, vertically, and depth-wise (whatever that means).

In [117]:
# Create two arrays for demonstration purposes.
a = np.arange(9).reshape(3, 3)
b = (a + 1) * 10

In [118]:
a

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [119]:
b

array([[10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

Horizontal stacking combines arrays in a left-to-right manner. A tuple containing the arrays to be stacked, in order, mst be passed to the `np.hstack()` function. It is functionally equivalent to using `np.concatenate(a, b, axis=1)`, although i suspect that `np.hstack()` is more common.

In [120]:
np.hstack((a, b))

array([[ 0,  1,  2, 10, 20, 30],
       [ 3,  4,  5, 40, 50, 60],
       [ 6,  7,  8, 70, 80, 90]])

Vertical stacking with `np.vstack()` combines arrays top-to-bottom. It is functionally equivalent to using `np.concatenate(a, b, axis=0)`, although i suspect that `np.vstack()` is more common.

In [121]:
np.vstack((a, b))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

Depth stacking arranges them in order of an additional axis, depth. It stacks each independent row from each array as columns together. I think. It's something like that.

In [122]:
np.dstack((a, b))

array([[[ 0, 10],
        [ 1, 20],
        [ 2, 30]],

       [[ 3, 40],
        [ 4, 50],
        [ 5, 60]],

       [[ 6, 70],
        [ 7, 80],
        [ 8, 90]]])

Column stacking performs a horizontal stack of 1D arrays, making each array a column in the resulting 2D array.

In [123]:
one_d_a = np.arange(1, 10)
one_d_b = one_d_a * 10

In [124]:
one_d_a

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [125]:
one_d_b

array([10, 20, 30, 40, 50, 60, 70, 80, 90])

In [126]:
np.column_stack((one_d_a, one_d_b))

array([[ 1, 10],
       [ 2, 20],
       [ 3, 30],
       [ 4, 40],
       [ 5, 50],
       [ 6, 60],
       [ 7, 70],
       [ 8, 80],
       [ 9, 90]])

Row stacking works in the as-expected manner as a compliment to column stacking.

In [127]:
np.row_stack((one_d_a, one_d_b))

array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 20, 30, 40, 50, 60, 70, 80, 90]])

### Splitting Arrays

Multidimensional arrays can, of course, be split along horizontal, vertical, and depth axes using `np.hsplit()`, `np.vsplit()`, and `np.dsplit()` functions. For the sake of brevity, only `np.hsplit()` will be explored here, as the other functions work similarly.

`np.hsplit()` takes the array to be split as a parameter, and either a scalar value to specify the number of (evenly-sized) arrays to be returned, or a list of column indexes to split the array upon.

In [128]:
# Create a 4x4 array to work with.
a = np.arange(12).reshape(3, 4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [129]:
# Split the array into four even-sized columns.
np.hsplit(a, 4)

[array([[0],
        [4],
        [8]]), array([[1],
        [5],
        [9]]), array([[ 2],
        [ 6],
        [10]]), array([[ 3],
        [ 7],
        [11]])]

In [130]:
# Split the array into two evenly-sized matrices.
np.hsplit(a, 2)

[array([[0, 1],
        [4, 5],
        [8, 9]]), array([[ 2,  3],
        [ 6,  7],
        [10, 11]])]

In [131]:
# Split at columns 1 and 3.
np.hsplit(a, [1, 3])

[array([[0],
        [4],
        [8]]), array([[ 1,  2],
        [ 5,  6],
        [ 9, 10]]), array([[ 3],
        [ 7],
        [11]])]

The `np.split()` functions works the same when the `axis` parameter is set to 1.

In [132]:
# Equivalent to np.hsplit(a, 2)
np.split(a, 2, axis=1)

[array([[0, 1],
        [4, 5],
        [8, 9]]), array([[ 2,  3],
        [ 6,  7],
        [10, 11]])]

### Useful Numerical Methods of NumPy Arrays

NumPy brings along an ass-load of functions and methods than can be applied to arrays. Many of these are mathematical in nature.

The are the usual math-related functions, and they can be applied to one and two-dimensional arrays. In the case of two-dimensional arrays, however, an axis may need to be specified.

In [133]:
# A demonstration of what NumPy arrays can do.
m = np.arange(10, 19).reshape(3, 3)
print(m)
print("Min of the entire matrix: {}".format(m.min()))
print("Max of the entire matrix: {}".format(m.max()))
print("Indices of the min value: {}".format(m.argmin()))
print("Indices of the max value: {}".format(m.argmax()))
print("Min value of each column: {}".format(m.min(axis=0)))
print("Min value of each row:    {}".format(m.min(axis=1)))
print("Max value of each column: {}".format(m.max(axis=0)))
print("Max value of each row:    {}".format(m.max(axis=1)))

[[10 11 12]
 [13 14 15]
 [16 17 18]]
Min of the entire matrix: 10
Max of the entire matrix: 18
Indices of the min value: 0
Indices of the max value: 8
Min value of each column: [10 11 12]
Min value of each row:    [10 13 16]
Max value of each column: [16 17 18]
Max value of each row:    [12 15 18]


There are some handy stats functions, too.

In [134]:
a = np.arange(1, 11)
a

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [135]:
a.mean()

5.5

In [136]:
a.std()

2.8722813232690143

In [137]:
a.var()

8.25

In [138]:
a.sum()

55

In [139]:
a.prod()

3628800

In [141]:
a.cumsum()

array([ 1,  3,  6, 10, 15, 21, 28, 36, 45, 55], dtype=int32)

In [142]:
a.cumprod()

array([      1,       2,       6,      24,     120,     720,    5040,
         40320,  362880, 3628800], dtype=int32)

The `.all()` method returns `True` if all elements in the array are true, and `.any()` returns `True` if any element of the array is true. Like a whole bunch of `and` and `or`, respectively.

In [143]:
# Using .any() on a logical statement.
(a < 5).any()

True

In [144]:
(a < 5).all()

False

The `.size` property returns how many elements are in an array.

In [145]:
m.size

9

`.ndim` returns the overall dimensionality of an array.

In [147]:
a = np.arange(12)
a.ndim

1

In [149]:
a.resize(3, 4)
a.ndim

2

In [152]:
a.resize(2, 2, 2)
a.ndim

3