# NumPy for pandas
Numerical Python (NumPy) is an open source Python library for scientific computing.
NumPy provides a host of features that allow a Python programmer to work with
high-performance arrays and matrices. NumPy arrays are stored more efficiently than
Python lists and allow mathematical operations to be vectorized, which results in
significantly higher performance than with looping constructs in Python.
pandas builds upon functionality provided by NumPy. The pandas library relies
heavily on the NumPy array for the implementation of the pandas Series and
DataFrame objects, and shares many of its features such as being able to slice
elements and perform vectorized operations. It is therefore useful to spend some
time going over NumPy arrays before diving into pandas.

In [1]:
# this allows us to access numpy using the
 # np. prefix
import numpy as np

# Benefits and characteristics of NumPy arrays
- Contiguous allocation in memory
- Vectorized operations
- Boolean selection
- Sliceability

In [2]:
# a function that squares all the values
 # in a sequence
def squares(values):
 result = []
 for v in values:
  result.append(v * v)
 return result


In [3]:
# create 100,000 numbers using python range
to_square = range(100000)
# time how long it takes to repeatedly square them all
%timeit squares(to_square)

19 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [4]:
# now lets do this with a numpy array
array_to_square = np.arange(0, 100000)
# and time using a vectorized operation
%timeit array_to_square ** 2

120 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Vectorization of the operation made our code simpler and also performed roughly
158 times faster!

In [5]:
a1 = np.array([1, 2, 3, 4, 5])
a1

array([1, 2, 3, 4, 5])

In [6]:
type(a1)

numpy.ndarray

In [7]:
np.size(a1)

5

In [8]:
a2 = np.array([1, 2, 3, 4.0, 5.0])
a2

array([1., 2., 3., 4., 5.])

In [9]:
a2.dtype

dtype('float64')

In [10]:
a3 = np.array([0]*10)
a3

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [11]:
np.array(range(10))

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [12]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [14]:
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [15]:
np.arange(0,10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [16]:
# 0 <= x < 10 increment by two
np.arange(0, 10, 2)

array([0, 2, 4, 6, 8])

In [17]:
# 10 >= x > 0, counting down
np.arange(10, 0, -1)

array([10,  9,  8,  7,  6,  5,  4,  3,  2,  1])

The `np.linspace()` function is similar to np.arange(), but generates an array of a
specific number of items between the specified start and stop values:

In [18]:
np.linspace(0, 10, 11)

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])

<b>Note </b>that the datatype of the array by default is float, and
that the start and end values are inclusive.

In [19]:
a1 = np.arange(0, 10)*2
a1

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [20]:
a2 = np.arange(10,20)
a1 + a2

array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37])

the pandas Series
and DataFrame objects operate similarly to one-and two-dimensional arrays,
respectively.

In [21]:
np.array([[1,2], [3,4]])

array([[1, 2],
       [3, 4]])

In [22]:
m = np.arange(0,20).reshape(5,4)

In [23]:
m

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [24]:
m.size

20

In [25]:
np.size(m)

20

In [26]:
# can ask the size along a given axis (0 is rows)
np.size(m, 0)

5

In [28]:
# and 1 is the columns
np.size(m, 1)

4

In [29]:
# select 0-based elements 0 and 2
a1[0], a1[2]

(0, 4)

In [30]:
# select an element in 2d array at row 1 column 2
m[1, 2]


6

In [31]:
# all items in row 1
m[1,]

array([4, 5, 6, 7])

It is possible to retrieve an entire column of a two-dimensional array using the :
symbol for the row (just omitting the row value is a syntax error):

In [32]:
m[:,2]

array([ 2,  6, 10, 14, 18])

In [33]:
m

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

## Logical operations on arrays
Logical operations can be applied to arrays to test the array values against specific
criteria. The following code tests if the values of the array are less than 2:

In [44]:
a = np.arange(10,16)
a<2

array([False, False, False, False, False, False])

In [45]:
np.take(
    a,
    np.where(a<14)
)

array([[10, 11, 12, 13]])

In [47]:
# this is commented as it will cause an exception
#print (a<2 or a>3)

In [48]:
# less than 2 or greater than 3?
(a<2) | (a>3)

array([ True,  True,  True,  True,  True,  True])

In [49]:
def exp (x):
 return x<3 or x>3

In [50]:
# np.vectorize applies the method to all items in an array
np.vectorize(exp)(a)

array([ True,  True,  True,  True,  True,  True])

In [53]:
# boolean select items < 3
a = np.arange(6)
r = a<3
 # applying the result of the expression to the [] operator
 # selects just the array elements where there is a matching True
a[r]

array([0, 1, 2])

In [54]:
np.sum(a<3)

3

In [55]:
a1 = np.arange(5)
a2 = np.arange(5,0,-1)
a1 < a2

array([ True,  True,  True, False, False])

In [56]:
a1 = np.arange(9).reshape(3, 3)
a2 = np.arange(9, 0 , -1).reshape(3, 3)
a1 < a2

array([[ True,  True,  True],
       [ True,  True, False],
       [False, False, False]])

## Slicing arrays
- ` start:end:step`

In [57]:
a1 = np.arange(1, 10)
a1[3:8]

array([4, 5, 6, 7, 8])

In [58]:
a1[::2]

array([1, 3, 5, 7, 9])

In [59]:
a1[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1])

In [66]:
 # note that when in reverse, this does not include
 # the element specified in the second component of the slice
# that is, there is no 1 printed in this

a1[9:0:-1]

array([9, 8, 7, 6, 5, 4, 3, 2])

In [67]:
 # all items from position 5 onwards
a1[5:]

array([6, 7, 8, 9])

In [68]:
a[:5]

array([0, 1, 2, 3, 4])

In [69]:
m[:,1 ]

array([ 1,  5,  9, 13, 17])

In [70]:
m[:,1:3]

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10],
       [13, 14],
       [17, 18]])

In [71]:
m[3:5,:]

array([[12, 13, 14, 15],
       [16, 17, 18, 19]])

In [72]:
m[3:5,1:3]

array([[13, 14],
       [17, 18]])

In [73]:
m[[1,3,4],:]

array([[ 4,  5,  6,  7],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [74]:
m

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [75]:
# create a 9 element array (1x9)
a = np.arange(0, 9)
# and reshape to a 3x3 2-d array
m = a.reshape(3, 3)
m

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [81]:
reshaped = m.reshape(9)
reshaped

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [82]:
raveled = m.ravel()
raveled

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [84]:
reshaped = m.reshape(np.size(m))
reveled = m.ravel()

In [85]:
reshaped[2] =1000
reveled[5] = 2000

In [86]:
m

array([[   0,    1, 1000],
       [   3,    4, 2000],
       [   6,    7,    8]])

The `.flatten()` method functions similarly to `.ravel()` but instead returns a new
array with copied data instead of a view. Changes to the result do not change the
original matrix:

In [87]:
# flattened is like ravel, but a copy of the data,
# not a view into the source
m2 = np.arange(0, 9).reshape(3,3)
flattened = m2.flatten()
# change in the flattened object
flattened[0] = 1000
flattened


array([1000,    1,    2,    3,    4,    5,    6,    7,    8])

In [88]:
m2

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [90]:
# we can reshape by assigning a tuple to the .shape property
 # we start with this, which has one dimension
flattened.shape

(9,)

In [None]:
The property can also be assigned a tuple, which will force the array to reshape itself
as specified:

In [91]:
# and make it 3x3
flattened.shape = (3, 3)
# it is no longer flattened
flattened

array([[1000,    1,    2],
       [   3,    4,    5],
       [   6,    7,    8]])

In linear algebra, it is common to transpose a matrix. This can be performed with the
`.transpose()` method, as shown here:

In [92]:
flattened.transpose()

array([[1000,    3,    6],
       [   1,    4,    7],
       [   2,    5,    8]])

In [93]:
flattened.T

array([[1000,    3,    6],
       [   1,    4,    7],
       [   2,    5,    8]])

The `.resize()` method functions similarly to the `.reshape()` method, except
that while reshaping returns a new array with data copied into it, `.resize()`
performs an in-place reshaping of the array.:

In [94]:
# we can also use .resize, which changes shape of
# an object in-place
m = np.arange(0, 9).reshape(3,3)
m.resize(1, 9)
m # my shape has changed

array([[0, 1, 2, 3, 4, 5, 6, 7, 8]])

In [95]:
# we can also use .resize, which changes shape of
# an object in-place
m = np.arange(0, 9).reshape(3,3)
m.resize(1, 9)
m # my shape has changed


array([[0, 1, 2, 3, 4, 5, 6, 7, 8]])

# Combining arrays
Arrays can be combined in various ways. This process in NumPy is referred to
as stacking. Stacking can take various forms, including horizontal, vertical, and
depth-wise stacking. To demonstrate this, we will use the following two arrays
(a and b):

In [96]:
# creating two arrays for examples
a = np.arange(9).reshape(3, 3)
b = (a + 1) * 10
a

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [97]:
b

array([[10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

In [98]:
# horizontally stack the two arrays
 # b becomes columns of a to the right of a's columns
np.hstack((a, b))

array([[ 0,  1,  2, 10, 20, 30],
       [ 3,  4,  5, 40, 50, 60],
       [ 6,  7,  8, 70, 80, 90]])

In [99]:
# identical to concatenate along axis = 1
np.concatenate((a, b), axis = 1)

array([[ 0,  1,  2, 10, 20, 30],
       [ 3,  4,  5, 40, 50, 60],
       [ 6,  7,  8, 70, 80, 90]])

In [100]:
# vertical stack, adding b as rows after a's rows
np.vstack((a, b))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

In [101]:
# concatenate along axis=0 is the same as vstack
np.concatenate((a, b), axis = 0)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

Depth stacking takes a list of arrays and arranges them in order along an additional
axis referred to as the depth:


In [102]:
 np.dstack((a, b))

array([[[ 0, 10],
        [ 1, 20],
        [ 2, 30]],

       [[ 3, 40],
        [ 4, 50],
        [ 5, 60]],

       [[ 6, 70],
        [ 7, 80],
        [ 8, 90]]])

In [103]:
# set up 1-d array
one_d_a = np.arange(5)
one_d_a

array([0, 1, 2, 3, 4])

In [104]:
# another 1-d array
one_d_b = (one_d_a + 1) * 10

In [105]:
one_d_b

array([10, 20, 30, 40, 50])

In [106]:
# stack the two columns
np.column_stack((one_d_a, one_d_b))

array([[ 0, 10],
       [ 1, 20],
       [ 2, 30],
       [ 3, 40],
       [ 4, 50]])

In [107]:
# stack along rows
np.row_stack((one_d_a, one_d_b))

array([[ 0,  1,  2,  3,  4],
       [10, 20, 30, 40, 50]])

# Splitting arrays

Arrays can also be split into multiple arrays along the horizontal, vertical, and depth
axes using the np.hsplit(), np.vsplit(), and np.dsplit() functions. We will
only look at the np.hsplit() function as the others work similarly.
The np.hsplit() function takes the array to split as a parameter, and either a scalar
value to specify the number of arrays to be returned, or a list of column indexes to
split the array upon.
If splitting into a number of arrays, each array returned will have the same count of
columns. The source array must have a number of columns that is a multiple of the
specified value

In [108]:
# sample array
a = np.arange(12).reshape(3, 4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [109]:
np.hsplit(a,4)

[array([[0],
        [4],
        [8]]), array([[1],
        [5],
        [9]]), array([[ 2],
        [ 6],
        [10]]), array([[ 3],
        [ 7],
        [11]])]

In [110]:
np.hsplit(a, 2)

[array([[0, 1],
        [4, 5],
        [8, 9]]), array([[ 2,  3],
        [ 6,  7],
        [10, 11]])]

In [112]:
 # split at columns 1 and 3
np.hsplit(a, [1, 3])


[array([[0],
        [4],
        [8]]), array([[ 1,  2],
        [ 5,  6],
        [ 9, 10]]), array([[ 3],
        [ 7],
        [11]])]

In [113]:
# along the rows
np.split(a, 2, axis = 1)


[array([[0, 1],
        [4, 5],
        [8, 9]]), array([[ 2,  3],
        [ 6,  7],
        [10, 11]])]

In [114]:
# new array for examples
a = np.arange(12).reshape(4, 3)
a

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [115]:
# split into four rows of arrays
np.vsplit(a, 4)

[array([[0, 1, 2]]),
 array([[3, 4, 5]]),
 array([[6, 7, 8]]),
 array([[ 9, 10, 11]])]

In [116]:
# into two rows of arrays
np.vsplit(a, 2)


[array([[0, 1, 2],
        [3, 4, 5]]), array([[ 6,  7,  8],
        [ 9, 10, 11]])]

In [117]:
# split along axis=0
 # row 0 of original is row 0 of new array
 # rows 1 and 2 of original are row 1
np.vsplit(a, [1, 3])

[array([[0, 1, 2]]), array([[3, 4, 5],
        [6, 7, 8]]), array([[ 9, 10, 11]])]

In [119]:
# split can specify axis
np.split(a, 2, axis = 0)


[array([[0, 1, 2],
        [3, 4, 5]]), array([[ 6,  7,  8],
        [ 9, 10, 11]])]

In [120]:
# 3-d array
c = np.arange(27).reshape(3, 3, 3)
c

array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])

In [121]:
# split into 3
np.dsplit(c, 3)

[array([[[ 0],
         [ 3],
         [ 6]],
 
        [[ 9],
         [12],
         [15]],
 
        [[18],
         [21],
         [24]]]), array([[[ 1],
         [ 4],
         [ 7]],
 
        [[10],
         [13],
         [16]],
 
        [[19],
         [22],
         [25]]]), array([[[ 2],
         [ 5],
         [ 8]],
 
        [[11],
         [14],
         [17]],
 
        [[20],
         [23],
         [26]]])]

## Useful numerical methods of NumPy arrays

In [123]:
# demonstrate some of the properties of NumPy arrays
m = np.arange(10, 19).reshape(3, 3)
print (a)
print ("{0} min of the entire matrix".format(m.min()))
print ("{0} max of entire matrix".format(m.max()))
print ("{0} position of the min value".format(m.argmin()))
print ("{0} position of the max value".format(m.argmax()))
print ("{0} mins down each column".format(m.min(axis = 0)))
print ("{0} mins across each row".format(m.min(axis = 1)))
print ("{0} maxs down each column".format(m.max(axis = 0)))
print ("{0} maxs across each row".format(m.max(axis = 1)))

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
10 min of the entire matrix
18 max of entire matrix
0 position of the min value
8 position of the max value
[10 11 12] mins down each column
[10 13 16] mins across each row
[16 17 18] maxs down each column
[12 15 18] maxs across each row


In [124]:
# demonstrate included statistical methods
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [125]:
a.mean(), a.std(), a.var()

(4.5, 2.8722813232690143, 8.25)

In [126]:
a = np.arange(1, 6)
a

array([1, 2, 3, 4, 5])

In [127]:
a.sum(), a.prod()

(15, 120)

The cumulative sum and products can be computed with the .cumsum() and
.cumprod() methods:

In [128]:
a # and cumulative sum and prod
a.cumsum(), a.cumprod()

(array([ 1,  3,  6, 10, 15], dtype=int32),
 array([  1,   2,   6,  24, 120], dtype=int32))

In [129]:
# applying logical operators
a = np.arange(10)
(a < 5).any() # any < 5?


True

In [130]:
(a < 5).all() # all < 5? (a < 5).any() # any < 5?

False

In [131]:
# size is always the total number of elements
np.arange(10).reshape(2, 5).size

10

In [132]:
# .ndim will give you the total # of dimensions
np.arange(10).reshape(2,5).ndim

2

# Summary
In this chapter, we have examined NumPy arrays to get an understanding of their
capabilities to manipulate data and performe operations on data including selecting
elements, vectorization, Boolean selection, reshaping, stacking, concatenation,
splitting, and slicing. NumPy has many other features, but these are the ones that
are important to understand as they will set a frame of reference for understanding
the operation of pandas Series and DataFrame objects. All the concepts covered in
this chapter will be examined in much more detail in the next two chapters, where
they are applied to pandas objects, which extend these capabilities to provide a much
richer and more expressive means of representing and manipulating data than is
offered with NumPy arrays.