# Chapter 4

# NumPy Basics: Arrays and Vectorized Computation

***

NumPy is one of most important packages for numerical computing in Python. It is a open source python library that mainly provides a multidimensional array object to work with.



NumPy provides __ndarray__, with its methods to efficiently operate on it. The array object in NumPy is called __ndarray__. The array on the NumPy aims to counter the slow built-in sequence in Python like lists with its speed.

To check th difference in list and the array consider the example below with one million integer.

In [1]:
import numpy as np

It is a standard practice to __import numpy as np__. 

In [2]:
a_arr = np.arange(1000000)

In [3]:
a_list = list(range(1000000))

multiply each sequence by 2:

In [4]:
%time for _ in range(10): a_arr2 = a_arr*2

Wall time: 18.9 ms


In [5]:
%time for _ in range(10): a_list2 = [x * 2 for x in a_list]

Wall time: 851 ms


We can clearly see the time difference to perform the two action from above. Hence, the use of array for huge datasets.

***

## 1. The NumPy ndarray: A Multidimensional Array Object

N-dimensional array object, or ndarray in short is one of the main features of NumPy. It is fast, flexible container for large datasets in Python. Arrays allows to perform mathematical operations on whole block of data using similar syntax to the equivalent operations between scalar elements. 

For example,

In [6]:
data = np.random.randn(2,3)

In [7]:
data

array([[-1.3614347 ,  0.76309189,  0.99520754],
       [-1.26694389, -1.76733928, -0.31894432]])

In [8]:
data * 2

array([[-2.7228694 ,  1.52618379,  1.99041507],
       [-2.53388779, -3.53467855, -0.63788864]])

In [9]:
data - data

array([[0., 0., 0.],
       [0., 0., 0.]])

The specified operations are carried out to each element in the array. 

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:


In [10]:
data.shape

(2, 3)

In [11]:
data.dtype

dtype('float64')

### Creating ndarrays

__array__ function is the most basic way to create an ndarray. This function accepts any sequence like object (including other arrays) and produces a new NumPy array with passed data. For example,

In [12]:
data1 = [1 , 32, 45, 66, 87, 99]

In [13]:
arr1 = np.array(data1)

In [14]:
arr1

array([ 1, 32, 45, 66, 87, 99])

In [15]:
arr1.dtype

dtype('int32')

similarly, we can create higher dimension arrays with nested sequences as passed value in the function __array__.

In [16]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]

In [17]:
arr2 = np.array(data2)

In [18]:
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

Furthermore, __ndim__ and __shape__ attributes can be used to check the dimension and the shape of a array

In [19]:
arr1.ndim

1

In [20]:
arr2.ndim

2

In [21]:
arr1.shape

(6,)

In [22]:
arr2.shape

(2, 4)

There are additional other ways to create new arrays. __zeros__ and __ones__ create arrays of 0s and 1s respectively with given length or shape.

In [23]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [24]:
np.ones((3, 3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

Another way is to use __empty__ which creates an array without initializing its values to any particular value. It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it may return uninitialized “garbage” values. 

In [25]:
np.empty((2, 2))

array([[2.12199579e-314, 9.37922140e-312],
       [3.81418679e-321, 6.95250784e-310]])

arange is an array-valued version of the built-in range function on Python:

In [26]:
np.arange(5)

array([0, 1, 2, 3, 4])

***

### Data Types for ndarrays

data type or __dtype__ is a special object that holds the information the ndarray needs to interpret a chunk of memory as particular type of data:

We can excplicitly define the dtype of the array while creating an array but its optional.

In [27]:
arr1 = np.array([1, 2, 3], dtype=np.float64)

In [28]:
arr2 = np.array([1, 2, 3], dtype=np.int32)

In [29]:
arr1.dtype

dtype('float64')

In [30]:
arr2.dtype

dtype('int32')

we can explicitly convert or cast an array from one dtype to another using ndarray's __astype__ method:

In [31]:
arr = np.array([1, 2, 3, 4, 5])

In [32]:
arr.dtype

dtype('int32')

In [33]:
arr.astype(np.float64)

array([1., 2., 3., 4., 5.])

In [34]:
arr.dtype

dtype('int32')

<span style='color:red'>Note:</span> __astype__ method returns the copy of the result by creating a new array; it does not change the original array itself.

In [35]:
f_array = np.array([1.2, 3.2, 4.5, 1.1, 0.2])

In [36]:
f_array.dtype

dtype('float64')

In [37]:
f_array.astype(np.int32)

array([1, 3, 4, 1, 0])

Converting float dtype array to int; truncates the values after the decimal point in the array for all element.

similarly, if a array of strings represent numbers, we can use __astype__ to convert them to numeric forms:

In [38]:
numeric_strings = np.array(['2.3', '44.5', '3.22', '9.88', '55.11'], dtype=np.string_)

In [39]:
numeric_strings

array([b'2.3', b'44.5', b'3.22', b'9.88', b'55.11'], dtype='|S5')

In [40]:
numeric_strings.astype(float)

array([ 2.3 , 44.5 ,  3.22,  9.88, 55.11])

<span style='color:red'>Note:</span>It’s important to be cautious when using the numpy.string_ type, as string data in NumPy is fixed size and may truncate input without warning. pandas has more intuitive out-of-the-box behavior on non-numeric data

NumPy aliases the Python types to its own equivalent data types. Thus, only passing float rather than np.float did not raised error.

Furthermore, if the casting were to fail for some reason(like a string cannot be converted to float64), a ValueError will be raised.

We can also assign the data type of one array to another. For example:

In [41]:
int_array = np.arange(10)

In [42]:
int_array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [43]:
float_array = np.random.randn(5)

In [44]:
float_array

array([-0.36634617, -0.61533867,  0.58029466, -0.78510612, -1.13210423])

In [45]:
float_array.dtype

dtype('float64')

In [46]:
int_array.dtype

dtype('int32')

Now, converting the dtype of int_array to float 64 using float_array as reference.

In [47]:
int_array.astype(float_array.dtype)

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

***

### Arithmetic with NumPy Arrays

Arrays enables to express batch operations on data without writing any for loops which is called __vectorization__ in NumPy. Any arithmetic operations between equal size arrays applies the operation element-wise:

In [48]:
array = np.arange(6).reshape((2, 3))

In [49]:
array

array([[0, 1, 2],
       [3, 4, 5]])

In [50]:
array + array

array([[ 0,  2,  4],
       [ 6,  8, 10]])

In [51]:
array * array

array([[ 0,  1,  4],
       [ 9, 16, 25]])

Arithmetic operations with scalars propagate the scalar argument to each element in
the array:

In [52]:
array / 2

array([[0. , 0.5, 1. ],
       [1.5, 2. , 2.5]])

Comparision of arrayas of same size yield boolean arrays:

In [53]:
array2 = np.random.randn(6).reshape((2, 3))

In [54]:
array2

array([[-0.61283953, -1.47586485,  0.65819075],
       [-0.55513632,  0.49524572,  0.38387671]])

In [55]:
array2 > array

array([[False, False, False],
       [False, False, False]])

Operations between differently sized arrays is called __broadcasting__.

***

### Basic Indexing and Slicing

Indexing is a vast topic in NumPy array, as it has many ways to select a portion of your data or each element. In 1-D array, the logic works similar to Python lists:

In [56]:
arr = np.arange(10)

In [57]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [58]:
arr[5]

5

In [59]:
arr[2:4]

array([2, 3])

Now, for assigning value with slices:

In [60]:
arr[8:] = 7 

In [61]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 7, 7])

The important thing to note here in NumPy arrays are; if any modification is done to the view, it is changed in the souce array. For example:

In [62]:
arr_slice = arr[8:]

In [63]:
arr_slice

array([7, 7])

Now, if the values of arr_slice is changed it will also change in the source array.

In [64]:
arr_slice[:] = 8

In [65]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 8])

We can use, __array.copy()__ method explicitly to copy the array instead of view.

For example,

In [66]:
arr_slice2 = arr[1:5].copy()

In [67]:
arr_slice2

array([1, 2, 3, 4])

In [68]:
arr_slice2[:] = 0

In [69]:
arr_slice2

array([0, 0, 0, 0])

In [70]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 8])

In this case, the original array is unaffected as __.copy__ is used to explicitly copy instead of view of original array.

For higher dimensional arrays, it can be treated as nested lists from python, where the element at each index are no longer scalars but rather lower dimensional ndarray.

##### In 2-D Array

In [71]:
arr2d = np.arange(1, 10).reshape((3, 3))

In [72]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [73]:
arr2d[1]

array([4, 5, 6])

In [74]:
arr2d[2][0]

7

In [75]:
arr2d[2, 2]

9

##### In 3-D Array

In [76]:
arr3d = np.arange(1, 13).reshape((2, 2, 3))

In [77]:
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [78]:
arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

Both scalar values or another array can be assigned to the slices.

In [79]:
backup = arr3d[0].copy()

In [80]:
arr3d[0] = 0

In [81]:
arr3d

array([[[ 0,  0,  0],
        [ 0,  0,  0]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [82]:
arr3d[0] = backup

In [83]:
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

similarly,

In [84]:
arr3d[1, 0]

array([7, 8, 9])

<span style='color:red'>Note</span> that in all of these cases where subsections of the array have been selected, the returned arrays are views.

#### Indexing with slices

As mentioned, ndarrays can be sliced with similar syntax of Python lists. for example

In [85]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 8])

In [86]:
arr[2:8]

array([2, 3, 4, 5, 6, 7])

In [87]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [88]:
arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

In 2-D array, it has sliced along axis 0, the first axis. A slice, therefore selects a range of elements along an axis.

We can pass multiple slices like the multiple indexes separated with comma.

In [89]:
arr2d[:2, 1:]

array([[2, 3],
       [5, 6]])

When slicing like this, you always obtain array views of the same number of dimensions. By mixing integer indexes and slices, you get lower dimensional slices.

In [90]:
arr2d[1, :2]

array([4, 5])

similarly, assigning to slice expression assigns to whole section:

In [91]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [92]:
arr2d[:2, :1] = 0

In [93]:
arr2d

array([[0, 2, 3],
       [0, 5, 6],
       [7, 8, 9]])

***

### Boolean Indexing

Lets set a example with array of names with duplicate value and another array with random normally distributed data with __randn__ function in numpy.random.

In [94]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

In [95]:
 data = np.random.randn(7, 4)

In [96]:
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

In [97]:
data

array([[-1.20490654,  2.03685471,  0.76180205, -0.38895965],
       [ 0.26760261, -0.24752945, -0.84425982,  1.66922996],
       [-0.26200389, -0.58179406, -2.17326023, -0.77225716],
       [-0.03685835,  0.04415635,  0.04193251, -0.01723994],
       [-0.47346649,  1.26775604, -0.32353586, -0.37955717],
       [-1.133134  , -0.92008706,  0.24985178,  2.11585902],
       [ 0.76948771, -1.06877757,  0.34356441, -0.0033556 ]])

Suppose each name corresponds to a row in the data array and we wanted to select all the rows with corresponding name 'Bob'. Like arithmetic operations, comparisons (such as ==) with arrays are also vectorized. Thus, comparing names with the string 'Bob' yields a boolean array:

In [98]:
names == 'Bob'

array([ True, False, False,  True, False, False, False])

now, this boolean array can be passed as indexing to data array.

In [99]:
data[names == 'Bob']

array([[-1.20490654,  2.03685471,  0.76180205, -0.38895965],
       [-0.03685835,  0.04415635,  0.04193251, -0.01723994]])

The boolean array must be of same length as the array axis it's indexing.

<span style='color:red'> Note:</span> Boolean section will not fail even if the boolean array is not the correct length.

Furthermore, to select everything but 'Bob', we can use __!=__ or negate the condition using __~__

In [100]:
names != 'Bob'

array([False,  True,  True, False,  True,  True,  True])

In [101]:
data[names != 'Bob']

array([[ 0.26760261, -0.24752945, -0.84425982,  1.66922996],
       [-0.26200389, -0.58179406, -2.17326023, -0.77225716],
       [-0.47346649,  1.26775604, -0.32353586, -0.37955717],
       [-1.133134  , -0.92008706,  0.24985178,  2.11585902],
       [ 0.76948771, -1.06877757,  0.34356441, -0.0033556 ]])

In [102]:
data[~(names == 'Bob')]

array([[ 0.26760261, -0.24752945, -0.84425982,  1.66922996],
       [-0.26200389, -0.58179406, -2.17326023, -0.77225716],
       [-0.47346649,  1.26775604, -0.32353586, -0.37955717],
       [-1.133134  , -0.92008706,  0.24985178,  2.11585902],
       [ 0.76948771, -1.06877757,  0.34356441, -0.0033556 ]])

It is more useful and effiecient like this:

In [103]:
cond = names == 'Bob'

In [104]:
data[~cond]

array([[ 0.26760261, -0.24752945, -0.84425982,  1.66922996],
       [-0.26200389, -0.58179406, -2.17326023, -0.77225716],
       [-0.47346649,  1.26775604, -0.32353586, -0.37955717],
       [-1.133134  , -0.92008706,  0.24985178,  2.11585902],
       [ 0.76948771, -1.06877757,  0.34356441, -0.0033556 ]])

& (and) and | (or) operators can be used to combine multiple conditions for boolean arithmetic opertions:

In [105]:
mask = (names == 'Bob') | (names == 'Will')

In [106]:
mask

array([ True, False,  True,  True,  True, False, False])

In [107]:
data[mask]

array([[-1.20490654,  2.03685471,  0.76180205, -0.38895965],
       [-0.26200389, -0.58179406, -2.17326023, -0.77225716],
       [-0.03685835,  0.04415635,  0.04193251, -0.01723994],
       [-0.47346649,  1.26775604, -0.32353586, -0.37955717]])

Selecting data from an array by boolean indexing always create a copy of the data, even if the returned array is unchanged.

<span style='color:red'> Note:</span> The Python keywords __'and'__ and __'or__' do not work with boolean arrays. Use the symbols __'&'__ and __'|'__ respectively.

Setting values with boolean arrays:

In [108]:
data[data < 0] = 0

In [109]:
data

array([[0.        , 2.03685471, 0.76180205, 0.        ],
       [0.26760261, 0.        , 0.        , 1.66922996],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.04415635, 0.04193251, 0.        ],
       [0.        , 1.26775604, 0.        , 0.        ],
       [0.        , 0.        , 0.24985178, 2.11585902],
       [0.76948771, 0.        , 0.34356441, 0.        ]])

In [110]:
data [names != 'Joe'] = 7

In [111]:
data

array([[7.        , 7.        , 7.        , 7.        ],
       [0.26760261, 0.        , 0.        , 1.66922996],
       [7.        , 7.        , 7.        , 7.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [0.        , 0.        , 0.24985178, 2.11585902],
       [0.76948771, 0.        , 0.34356441, 0.        ]])

***

### Fancy Indexing

Fancy indexing is a term to describe indexing using integer arrays:

In [112]:
arr = np.empty((3, 4))

In [113]:
arr

array([[9.39762227e-312, 2.81617418e-322, 0.00000000e+000,
        0.00000000e+000],
       [0.00000000e+000, 8.75983079e+164, 4.71769982e-090,
        8.60677331e-067],
       [9.98595732e-048, 3.38045906e-057, 6.48224659e+170,
        4.93432906e+257]])

In [114]:
for i in range(3):
    arr[i] = i

In [115]:
arr

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.]])

To select out a subset of the rows in a particular order, you can simply pass a list orndarray of integers specifying the desired order

In [116]:
arr[[1, 2, 0]]

array([[1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [0., 0., 0., 0.]])

In [117]:
arr[[-1, -2, -3]]

array([[2., 2., 2., 2.],
       [1., 1., 1., 1.],
       [0., 0., 0., 0.]])

By passing multiple index arrays selects a one-dimensional array of elements corresponding to each tuple of indices:

In [118]:
arr = np.arange(32).reshape((8, 4))

In [119]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [120]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]]

array([ 4, 23, 29, 10])

Here the elements (1, 0), (5, 3), (7, 1), and (2, 2) were selected. Regardless of how many dimensions the array has (here, only 2), the result of fancy indexing is always one-dimensional.

The behavior of fancy indexing in this case is a bit different,  which is the rectangular region formed by selecting a subset of the matrix’s rows and columns. For example:

In [121]:
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]

array([[ 4,  7,  5,  6],
       [20, 23, 21, 22],
       [28, 31, 29, 30],
       [ 8, 11,  9, 10]])

Fancy indexing, unlike slicing, always copies the data into a new array. Hence, the original array is not effected.

***

### Transposing Arrays and Swapping Axes

Transposing is a mathematical function that changes rows to columns and vice versa in the matrix.

Transposing is a special form of reshaping that similarly returns a veiw on the underlying data without copying anything. Arrays have __transpose__ method and also the special __T__ attribute:

In [122]:
arr = np.arange(15).reshape((3, 5))

In [123]:
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [124]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

It is useful when calculating the dot product of the matrix using __np.dot__

In [125]:
arr = np.random.randn(6, 3)

In [126]:
arr

array([[-0.45726237, -1.09278011,  0.14098256],
       [-0.14340008,  1.02921547, -0.05027271],
       [ 2.11683801,  0.54280933, -0.03433611],
       [ 0.83396711,  1.09722374, -0.17766716],
       [ 0.7101196 ,  0.01820362, -1.13591547],
       [-0.19611839,  0.27647013,  0.07863738]])

In [127]:
np.dot(arr.T, arr)

array([[ 5.94888905,  2.37489145, -1.10016753],
       [ 2.37489145,  3.82876187, -0.41831986],
       [-1.10016753, -0.41831986,  1.35163581]])

For higher dimensional arrays, __transpose__ will accept a tuple of axis numbers to permute the axes:

In [128]:
arr = np.arange(16).reshape((2, 2, 4))

In [129]:
arr

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

In [130]:
arr.transpose((1, 0, 2)) # Note: the higher dimensional arrays have multiple axes; in this case 3 axis for 3d array.

array([[[ 0,  1,  2,  3],
        [ 8,  9, 10, 11]],

       [[ 4,  5,  6,  7],
        [12, 13, 14, 15]]])

Here, the axes have been reordered with the second axis first, the first axis second, and the last axis unchanged

ndarray has the method __swapaxes__, which takes a pair of axis numbers and switches the indicated axes to rearrange the data. __swapaxes__ similarly returns a view on the data without making a copy:

In [131]:
arr

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

In [132]:
arr.swapaxes(1, 2)

array([[[ 0,  4],
        [ 1,  5],
        [ 2,  6],
        [ 3,  7]],

       [[ 8, 12],
        [ 9, 13],
        [10, 14],
        [11, 15]]])

***

## 4.2 Universal Functions: Fast Element-Wise Array Functions

Universal functions of in short __unfunc__ is a function that performs element-wise operations on data in ndarrays. Many ufuncs are simple element-wise transformations, like __sqrt__ or __exp__.

In [133]:
arr = np.arange(10)

In [134]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [135]:
np.sqrt(arr)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

In [136]:
np.exp(arr)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

These are unary ufuncs. Others such as add or maximum, take two arrays to operate (thus, binary ufuncs) and return a single array as the result.

In [137]:
x = np.random.randn(8)

In [138]:
y = np.random.randn(8)

In [139]:
x

array([-0.15166587,  0.14137693, -1.17602115,  1.04017828,  0.26283333,
        0.37110523, -2.34766472,  0.03997306])

In [140]:
y

array([-0.50723658,  1.185512  ,  1.43338552, -0.32959159, -1.13167902,
       -0.88264188,  0.39771944,  2.18583608])

In [141]:
np.maximum(x, y)

array([-0.15166587,  1.185512  ,  1.43338552,  1.04017828,  0.26283333,
        0.37110523,  0.39771944,  2.18583608])

Here, numpy.maximum computed the element-wise maximum of the elements in x and y

some ufunc like __modf__, can return multiple arrays. __modf__ is a vectorized version of built-in Python __divmod__; it returns the fractional and intergral parts of floating-point array:

In [142]:
arr = np.random.rand(5) * 5

In [143]:
arr

array([2.89431831, 1.50857355, 0.20620713, 1.80520653, 2.34562432])

In [144]:
remainder, whole_part = np.modf(arr)

In [145]:
remainder

array([0.89431831, 0.50857355, 0.20620713, 0.80520653, 0.34562432])

In [146]:
whole_part

array([2., 1., 0., 1., 2.])

Furthermore, Ufuncs can accept an optional out argument that allows them to operate in-place on arrays:

In [147]:
arr

array([2.89431831, 1.50857355, 0.20620713, 1.80520653, 2.34562432])

In [148]:
np.sqrt(arr)

array([1.70126962, 1.22824002, 0.45410035, 1.34357974, 1.53154312])

In [149]:
np.sqrt(arr, arr)

array([1.70126962, 1.22824002, 0.45410035, 1.34357974, 1.53154312])

In [150]:
arr

array([1.70126962, 1.22824002, 0.45410035, 1.34357974, 1.53154312])

***

## 4.3 Array-Oriented Programming with Arrays

NumPY arrays allows various kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is known as vectorization.

### Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression __x if condition else y__. For example:

In [151]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])

In [152]:
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])

In [153]:
cond = np.array([True, False, True, True, False])

Now, if we want to take a value from xarr when the corresponding value in cond is True, else take the value from yarr. Then, the list comprehension will be:

In [154]:
result = [(x if c else y)
         for x, y, c in zip(xarr, yarr, cond)]

In [155]:
result

[1.1, 2.2, 1.3, 1.4, 2.5]

Doing so is very slow process for large arrays and its not applicable for multidimensional array. so we can use __np.where__ like this:

In [156]:
result = np.where(cond, xarr, yarr)

In [157]:
result

array([1.1, 2.2, 1.3, 1.4, 2.5])

The second and third arguments to np.where don’t need to be arrays; one or both of them can be scalars. A typical use of where in data analysis is to produce a new array of values based on another array

For instance, if we had a matrix of randomly generated data and we want to replace all the positive values with 2 and all the negative values with -2. Then, we would do so using __np.where__ like so:

In [158]:
arr = np.random.randn(4, 4)

In [159]:
arr

array([[-0.78516965,  0.58060301,  0.01626919, -0.01073783],
       [ 1.27760903, -1.85373496, -0.32469633,  0.51570682],
       [ 1.27020228, -0.63041758,  1.53888679, -1.67593585],
       [-0.31828259, -0.47848203,  0.48667744, -0.44150042]])

In [160]:
np.where(arr>0, 2, -2)

array([[-2,  2,  2, -2],
       [ 2, -2, -2,  2],
       [ 2, -2,  2, -2],
       [-2, -2,  2, -2]])

The arrays passed to np.where can be more than just equal-sized arrays or scalars.

***

### Mathematical and Statistical Methods

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class.

We can use aggregations(often called reductions) like __sum__, __mean__ and __std__ either by calling the array instance method or using top level NumPy function.

For Example:

In [168]:
arr = np.arange(9).reshape((3, 3))

In [169]:
arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [170]:
arr.mean()

4.0

In [171]:
np.mean(arr)

4.0

In [172]:
arr.sum()

36

Functions like mean and sum takes optional __axis__ argument which then computes the statistic over the given axis, resulting in an array with one fewer dimension:

In [173]:
arr.mean(axis=1)

array([1., 4., 7.])

In [174]:
arr.sum(axis=1)

 # axis 1 is along the direction of column; it operates row wise

array([ 3, 12, 21])

In [175]:
arr.sum(axis=0)

 # axis 0 is along the direction of row; it operates column wise

array([ 9, 12, 15])

methods like __cumsum__ and __cumprod__ do not aggregate, instead producing an array of the intermediate results:

In [176]:
arr = np.arange(5)

In [177]:
arr

array([0, 1, 2, 3, 4])

In [179]:
arr.cumsum()

array([ 0,  1,  3,  6, 10], dtype=int32)

In [180]:
arr.cumprod()

array([0, 0, 0, 0, 0], dtype=int32)

In [181]:
arr = np.arange(9).reshape((3, 3))

In [186]:
arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [184]:
arr.cumsum(axis=0)

array([[ 0,  1,  2],
       [ 3,  5,  7],
       [ 9, 12, 15]], dtype=int32)

In [185]:
arr.cumprod(axis=1)

array([[  0,   0,   0],
       [  3,  12,  60],
       [  6,  42, 336]], dtype=int32)

***

### Methods for Boolean Arrays

Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, __sum__ is often used as a means of counting True values in a boolean array:

In [188]:
arr = np.random.randn(100)

In [193]:
(arr > 0).sum()

55

There are two more methods: __any__ and __all__ which are useful in boolean arrays. __any__ tests whether one or move values in an array is True, while __all__ checks if every value in the array is True

In [194]:
bool_arr = np.array([False, True, True, False, False, True])

In [195]:
bool_arr.any()

True

In [197]:
bool_arr.all()

False

These methods also work with non-boolean arrays, where non-zero elements are considered to be True and all the zero element are False.


***

### Sorting

NumPy arrays can be sorted with __sort__ method like list type in Python.

In [201]:
arr = np.random.randn(5)

In [202]:
arr

array([-0.84108824, -0.68606056, -0.98741838, -0.6465286 , -0.97031332])

In [203]:
arr.sort()

In [204]:
arr

array([-0.98741838, -0.97031332, -0.84108824, -0.68606056, -0.6465286 ])

Furthermore, each one-dimensional section of values in a multidimensional array can be sorted by passing the axis number.

In [213]:
arr = np.random.randn(3, 3)

In [214]:
arr

array([[-0.14269263,  1.61678322, -0.56882944],
       [-0.90417091,  0.01941304, -0.71689748],
       [ 0.03271841,  1.44779528,  0.49320639]])

In [215]:
arr.sort(1) # axis 1 is along the direction of columns.

In [216]:
arr

array([[-0.56882944, -0.14269263,  1.61678322],
       [-0.90417091, -0.71689748,  0.01941304],
       [ 0.03271841,  0.49320639,  1.44779528]])

***

### Unique and Other Set Logic

NumPy has some basic set operations for one-dimensional ndarrays. A commonly
used one is __np.unique__, which returns the sorted unique values in an array:

In [217]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

In [218]:
np.unique(names)

array(['Bob', 'Joe', 'Will'], dtype='<U4')

In [219]:
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])

In [220]:
np.unique(ints)

array([1, 2, 3, 4])

The pure python alternative for __np.unique__ would be:

In [221]:
sorted(set(names))

['Bob', 'Joe', 'Will']

Another function, np.in1d, tests membership of the values in one array in another,
returning a boolean array:

In [222]:
values = np.array([1, 2, 3, 4 , 5, 6])

In [223]:
np.in1d(values, [1, 9, 3])

array([ True, False,  True, False, False, False])

***

## 4.4 File Input and Output with Arrays

Data can be saved and loaded to and from the disk either in text or binary format in NumPy.

__np.save__ and __np.load__ are the two main functions for saving and loading the array data on disk respectively. Arrays are saved by default in an uncompressed raw binary format with file extension .npy:

In [224]:
arr = np.arange(10)

In [225]:
np.save('some_array', arr)

The extension of the file .npy will be appended automatically if not mentioned. 

The array saved can now be loaded with __np.load__:

In [226]:
np.load('some_array.npy')

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

we can save multiple arrays in an uncompressed archive using __np.savez__ and passing the
arrays as keyword arguments:

In [228]:
brr = np.arange(5)

In [229]:
np.savez('array_archieve.npz', a=arr, b=brr)

Now, when loading an .npz file, we get back a dict-like object that loads the individual
arrays lazily:

In [230]:
arch = np.load('array_archieve.npz')

In [231]:
arch['a']

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [232]:
arch['b']

array([0, 1, 2, 3, 4])

furthermore, to save multiple arrays by compressing we can use __np.savez_compressed__ :

In [233]:
np.savez_compressed('array_archieve.npz', a=arr, b=brr)

***

## 4.5 Linear Algebra

Linear algebra, like matrix multiplication, decompositions, determinants, and other square matrix math, is an important part of any array library. Unlike some languages like MATLAB, multiplying two two-dimensional arrays with * is an element-wise product instead of a matrix dot product. Thus, there is a function __dot__, both an array method and a function in the numpy namespace, for matrix multiplication:

In [234]:
x = np.arange(6).reshape((2, 3))

In [239]:
y = np.arange(6, 15).reshape((3, 3))

In [240]:
x

array([[0, 1, 2],
       [3, 4, 5]])

In [241]:
y

array([[ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [242]:
x.dot(y)

array([[ 33,  36,  39],
       [114, 126, 138]])

x.dot(y) is equivalent to np.dot(x, y)

In [243]:
np.dot(x, y)

array([[ 33,  36,  39],
       [114, 126, 138]])

__@__ symbol also works as an infix operator that performs matrix multiplication:

In [244]:
x @ y

array([[ 33,  36,  39],
       [114, 126, 138]])

A matrix product between a two-dimensional array and a suitably sized one-dimensional array results in a one-dimensional array:

In [245]:
x @ np.ones(3)

array([ 3., 12.])

__numpy.linalg__ has a standard set of matrix decompositions and things like inverse and determinant

In [246]:
from numpy.linalg import inv, qr

In [255]:
x = np.random.randn(3, 3)

In [256]:
mat = x.T.dot(x)

The expression X.T.dot(X) computes the dot product of X with its transpose X.T.

In [257]:
inv(mat)

array([[10.98388302, -3.12225959,  4.37768768],
       [-3.12225959,  1.6278065 , -0.5970527 ],
       [ 4.37768768, -0.5970527 ,  3.71697247]])

In [258]:
mat.dot(inv(mat))

array([[ 1.00000000e+00,  1.42255327e-17, -2.00578373e-16],
       [-2.73236955e-16,  1.00000000e+00,  6.30109707e-16],
       [ 8.00558781e-16, -1.00544272e-16,  1.00000000e+00]])

In [259]:
q, r = qr(mat)

In [260]:
r

array([[-1.03839883, -2.08767826,  1.00687481],
       [ 0.        , -0.48631038, -0.24594352],
       [ 0.        ,  0.        ,  0.17319684]])

***

## 4.6 Pseudorandom Number Generation

The numpy.random module supplements the built-in Python random with functions
for efficiently generating whole arrays of sample values from many kinds of probability distributions

For example, you can get a array of samples from the standard
normal distribution using __normal__:

In [262]:
samples = np.random.normal(size=(3, 3))

In [263]:
samples

array([[ 1.40914623,  0.12929614,  0.37196787],
       [ 0.1364388 , -2.02587074,  0.22613698],
       [-0.10566312, -2.51487788,  0.10090802]])

These are called pseudorandom numbers because they are generated by an algorithm with deterministic behavior based on the __seed__ of the random number generator. You can change NumPy’s random number generation seed using
__np.random.seed__:

In [264]:
np.random.seed(1234)

The data generation functions in __numpy.random__ use a global random seed. To avoid global state, you can use __numpy.random.RandomState__ to create a random number generator isolated from others:

In [265]:
rng = np.random.RandomState(1234)

In [266]:
rng.randn(5)

array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 , -0.72058873])