## Chapter 4 NumPy Basics: Arrays and Vectorized Computation
For most data analysis applications, the main areas of funcionality:
* Fast Vectorized array operations for data munging and cleaning, subsetting and any other kinds of computations.
* Common array algorithms like sorting, unique, and set operations. 
* Efficient descriptive statistics and aggregating/summarizing data.
* Data alignment and relation data manipulations for merging and joining together heterogeneous datasets.
* Expressing conditional logic as array expressions instead of loops with if-elif-else branches
* Group-wise data manipulations (aggregation, transformation, function application)
<br>
***
### 4.1 The NumPy ndarray: A Multidimensional Array Object
One of the key features of NumPy is its N-dimensional array object (__ndarray__) which is a fast, flexible container for large datasets in Python.<br>
Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar<br> 
elements. <br>
<br>
NumPy enables batch computations with similar syntax to scalar values on built-in Python objects, I first import NumPy and generate a small array<br>
of random data:

In [1]:
#Example:
import numpy as np

#Generate some random data

data = np.random.randn(2,3)
data

array([[-0.27167476, -0.77877042, -0.02588317],
       [-1.66818944,  1.43918142, -1.82932455]])

Then write mathematical operations with data:

In [2]:
data * 10

array([[ -2.7167476 ,  -7.78770419,  -0.25883171],
       [-16.68189437,  14.39181419, -18.29324553]])

In [3]:
data + data

array([[-0.54334952, -1.55754084, -0.05176634],
       [-3.33637887,  2.87836284, -3.65864911]])

An ndarray is a generic multidimensional container for homogeneous data; that is all of the elements must be the same type. Every array has a __shape__, a tuple indicating the size of each dimension,<br>
and a dtype, an object describing the _data type_ of the array. 

In [4]:
display(data.shape)
display(data.dtype)

(2, 3)

dtype('float64')

#### Creating ndarrays
The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example,<br>
a list is a good candidate for conversion:

In [5]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 0. , 1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:

In [6]:
data2 = [[1,2,3,4],[5,6,7,8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

Since data2 was a list of lists, the Numpy array __arr2__ has two dimensions with shape inferred from the data. We can confirm this by inspecting the __ndim__ and __shape__ attributes:

In [7]:
display(arr2.ndim)
display(arr2.shape)

2

(2, 4)

arrange is an array-valued version of the built-in Python range function:

In [8]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

You can explicity convert or _cast_ an array from one dtype to another using ndarray's __astype__ method:

In [9]:
arr= np.array([1,2,3,4,5])
display(arr.dtype)

float_arr = arr.astype(np.float64)
display(float_arr.dtype)

dtype('int32')

dtype('float64')

In [10]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
display(arr)
display(arr.astype(np.int32))

array([ 3.7, -1.2, -2.6,  0.5, 12.9, 10.1])

array([ 3, -1, -2,  0, 12, 10])

If you have an array of strings representing numbers, you can use __astype__ to convert them to numeric form:

In [11]:
numeric_strings = np.array(['1.25', '-9.6', '4.2'], dtype = np.string_)
numeric_strings.astype(np.float64)

array([ 1.25, -9.6 ,  4.2 ])

#### Arithmetic with NumPy Arrays
Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays<br>
applies the operation element-wise:

In [12]:
arr = np.array([[1.,2.,3.], [4., 5.,6.]])
display(arr) 
display(arr*arr)
display(arr-arr)

array([[1., 2., 3.],
       [4., 5., 6.]])

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

array([[0., 0., 0.],
       [0., 0., 0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in the array:

In [13]:
display(1/arr)
display(arr**0.5)

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

array([[1.        , 1.41421356, 1.73205081],
       [2.        , 2.23606798, 2.44948974]])

Comparisons betweena arrays of the same size yield boolean arrays:

In [14]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
display(arr2)
arr2 > arr

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

array([[False,  True, False],
       [ True, False,  True]])

Evaluating operation between differently sized arrays is called _broadcasting_ and will be discussed more detail in Appendix A.<br>
#### Basic Indexing and Sling
One-dimensional arrays are simple on the surface they act similarly to Python list:

In [15]:
arr = np.arange(10)
display(arr)
display(arr[5])
display(arr[5:8])
arr[5:8] = 12
display(arr)


array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

5

array([5, 6, 7])

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

Where arr[5:8] = 12, the value is _broadcasted_ to the entire selection. An important distinction from Python's built-in lists is that array <br>
slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source<br>
array. For example:

In [16]:
arr_slice = arr[5:8]
arr_slice

array([12, 12, 12])

Now, when I change values in __arr_slice__, the mutations are reflected in the original array __arr__:

In [17]:
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

In [18]:
arr_slice[1] = 12345
arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

The "bare" slice [:] will assign to all values in an array:

In [19]:
arr_slice[:] = 64
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but <br>
rather one-dimensional arrays:

In [20]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[2]

array([7, 8, 9])

Individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated list of indices to select<br> 
individual elements.

In [21]:
arr2d[0][2]

3

In [22]:
arr2d[0][2]

3

In multidimensional arrays, if you omit later indices, the return object will be a lower dimensional ndarray consisting of all the data along the higher dimensions. So in the 2 x 2 x 3 array arr3d:

In [23]:
arr3d = np.array([[[1,2,3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12,]]])
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

arr3d[0] is a 2x3 array:

In [24]:
arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

Both scalar values and arrays can be assigned to arr3d[0]:

In [30]:
old_values = arr3d[0].copy()
arr3d[0] = 42
arr3d

array([[[42, 42, 42],
        [42, 42, 42]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [31]:
arr3d[0] = old_values
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

Similarly arr3d[1,0] gives you all of the values whose indices start with (1,0), forming a 1-dimensional array:

In [32]:
arr3d[1,0]

array([7, 8, 9])

This expression is the same as though we had indexed in two steps:

In [33]:
x = arr3d[1]
x

array([[ 7,  8,  9],
       [10, 11, 12]])

In [34]:
x[0]

array([7, 8, 9])

Note that in all of these cases where subsections of the array have been selected, the returned arrays are views. <br>
#### Indexing with slices <br>
Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntax:

In [35]:
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

In [36]:
arr[1:6]

array([ 1,  2,  3,  4, 64])

Consider the two-dimensional array from before, arr2d. Slicing this array is a bit different:

In [37]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [38]:
arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

You can pass multiple slices just like you can pass multiple indexes:

In [39]:
arr2d[:2, 1:]

array([[2, 3],
       [5, 6]])

When slicing like this, you always obtain array views of the same number of dimensions. By mixing together indexes and slicesm you get lower<br>
dimensional slices.

In [40]:
arr2d[1, :2]

array([4, 5])

Of course, assigning to a slice expression assigns to the whole selection:

In [42]:
arr2d[:2, 1:] = 0
arr2d

array([[1, 0, 0],
       [4, 0, 0],
       [7, 8, 9]])

#### Boolean Indexing
Let's consider an example where we have some data in an array and an array of names with duplicates. I'm going to use here the __rand__ function <br>
in __numpy.random__ to generate some random normally distributed data:

In [46]:
names = np.array(['Bob', 'Joe','Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7,4)
data

array([[ 0.75398668, -0.18909075, -0.67957596,  2.36640061],
       [-0.18587135, -1.11334457,  2.44613276,  0.3083002 ],
       [-0.50792444, -0.67220222, -0.65253411, -1.36195853],
       [-0.31842426,  0.14191011, -1.38408929, -0.78362687],
       [ 0.38259099,  0.34495639,  0.82344222, -0.80749555],
       [ 0.70298264, -0.18893257, -0.77990613, -0.5230792 ],
       [-0.26046985,  2.25560693,  0.03335464, -1.09221324]])

Suppose each name corresponds to a row in the data array and we wanted to select all the rows with corresponding name 'Bob'. Like arithmetic <br>
operations, comparisons (such as ==) with arrays are also vectorized. Thus, comparing names with the string 'Bob' yields a boolean array:

In [47]:
names =='Bob'

array([ True, False, False,  True, False, False, False])

This boolean array can be bassed when indexing the array:

In [48]:
data[names=='Bob']

array([[ 0.75398668, -0.18909075, -0.67957596,  2.36640061],
       [-0.31842426,  0.14191011, -1.38408929, -0.78362687]])

The boolean array must be the same length as the array axis it's indexing. You can even mix and match boolean arrays with slices or intergers. <br>
<br>
In these examples, I select from the rows where names == 'Bob' and index the columns, too:

In [49]:
data[names == 'Bob', 2:]

array([[-0.67957596,  2.36640061],
       [-1.38408929, -0.78362687]])

In [50]:
data[names== 'Bob', 3]

array([ 2.36640061, -0.78362687])

To select everything but 'Bob', you can either use != or negate the condition using ~:

In [51]:
names != 'Bob'

array([False,  True,  True, False,  True,  True,  True])

In [52]:
data[~(names=='Bob')]

array([[-0.18587135, -1.11334457,  2.44613276,  0.3083002 ],
       [-0.50792444, -0.67220222, -0.65253411, -1.36195853],
       [ 0.38259099,  0.34495639,  0.82344222, -0.80749555],
       [ 0.70298264, -0.18893257, -0.77990613, -0.5230792 ],
       [-0.26046985,  2.25560693,  0.03335464, -1.09221324]])

The ~ operator can be useful when you want to invert a general condition. <br>
<br>
Selecting two of the threee names to combine multiple boolean conditions, use boolean arithmetic operators like & (and) and | (or):

In [54]:
mask = (names == 'Bob') | (names == 'Will')
mask

array([ True, False,  True,  True,  True, False, False])

In [55]:
data[mask]

array([[ 0.75398668, -0.18909075, -0.67957596,  2.36640061],
       [-0.50792444, -0.67220222, -0.65253411, -1.36195853],
       [-0.31842426,  0.14191011, -1.38408929, -0.78362687],
       [ 0.38259099,  0.34495639,  0.82344222, -0.80749555]])

Selecting data from an array by boolean indexing always creates a copy of the data, even if the returned array is unchanged. <br>
<br>
Setting values with boolean arrays works in a common-sense way. To set all of the negative values in data to 0 we need only do:

In [56]:
data[data<0] = 0
data

array([[0.75398668, 0.        , 0.        , 2.36640061],
       [0.        , 0.        , 2.44613276, 0.3083002 ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.14191011, 0.        , 0.        ],
       [0.38259099, 0.34495639, 0.82344222, 0.        ],
       [0.70298264, 0.        , 0.        , 0.        ],
       [0.        , 2.25560693, 0.03335464, 0.        ]])

Setting whole rows or columns using a one-dimensional boolean array is also easy:

In [58]:
data[names!= 'Joe'] = 7
data

array([[7.        , 7.        , 7.        , 7.        ],
       [0.        , 0.        , 2.44613276, 0.3083002 ],
       [7.        , 7.        , 7.        , 7.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [0.70298264, 0.        , 0.        , 0.        ],
       [0.        , 2.25560693, 0.03335464, 0.        ]])

#### Fancy Indexing
_Fancy indexing_ is a term adopted by NumPy to describe indexing using integer arrays. Suppose we had a 8x4 array:

In [60]:
arr = np.empty((8,4))
for i in range(8):
    arr[i] = i
arr

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])

To select out fo a subset of the rows in a particular order, you can simply pass a list or ndarray of intergers specifying the desired order:

In [61]:
arr[[4, 3, 0, 6]]

array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])

Using negative indices selects from the end:

In [63]:
arr[[-3, -5, -7]]

array([[5., 5., 5., 5.],
       [3., 3., 3., 3.],
       [1., 1., 1., 1.]])

Passing multiple index arrays does something slightly different; it selects a one dimensional array of elements corresponding to each tuple of<br>
indices:

In [64]:
arr = np.arange(32).reshape((8,4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [65]:
arr[[1,5,7,2], [0,3,1,2]]

array([ 4, 23, 29, 10])

Regardless of how many dimensions the array has (here, only 2), the result of fancy indexing with multiple integer arrays is always one-<br>
dimensional. <br>
<br>
The behavior of fancy indexing in this case is a bit different from what some users might have expected, which is the rectangular region formed by<br>
selecting a subset of the matrix's rows and columns. Here is one way to get that:

In [70]:
arr[[1,5,7,2]][:,[0, 3, 1,2]]

array([[ 4,  7,  5,  6],
       [20, 23, 21, 22],
       [28, 31, 29, 30],
       [ 8, 11,  9, 10]])

Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array. <br>
#### Transporting Arrays and Swapping Axes
Transporting is a special form of reshaping that similarly returns a view on the underlying data without copying anything. Arrays have the <br>
transpose method and also the special T attribute:

In [75]:
arr = np.arange(15).reshape((3,5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [76]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

When doing matrix computations, you may do this very often - for example, when computing the inner matrix product using __np.dot__:

In [78]:
arr = np.random.randn(6,3)
arr

array([[-0.79171851,  0.67120852, -0.04288797],
       [-0.80064417,  0.50463608,  1.38480094],
       [-1.2293575 , -0.07048092,  0.21104724],
       [-0.05041581,  1.45442821,  0.59787966],
       [-0.58593549, -0.25584698,  0.27511317],
       [ 0.29493137,  0.82808258, -1.02521925]])

In [79]:
np.dot(arr.T,arr)

array([[ 3.21201581, -0.52798472, -1.82794058],
       [-0.52798472,  3.57668586,  0.60537892],
       [-1.82794058,  0.60537892,  3.4482758 ]])

For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute the axes (for extra mind bending):

In [91]:
arr = np.arange(16).reshape((2, 2, 4))
arr

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

In [92]:
arr.transpose((1,0,2))

array([[[ 0,  1,  2,  3],
        [ 8,  9, 10, 11]],

       [[ 4,  5,  6,  7],
        [12, 13, 14, 15]]])