# Chapter 4. NumPy Basics: Arrays and Vectorized Computation
McKinney Chapter 4: https://learning.oreilly.com/library/view/python-for-data/9781098104023/ch04.html

# NumPy
One of the key features of NumPy is its **N-dimensional array object**, or ndarray, which is a fast, flexible container for large data sets in Python.  All elements must be of the same type.
  
NumPy, short for **Numerical Python**, is one of the most important foundational packages for numerical computing in Python. 
  
One of the reasons NumPy is so important for numerical computations in Python is because it is **designed for efficiency on large arrays of data**. There are a number of reasons for this:

- NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.
- NumPy operations perform complex computations on entire arrays **without the need for Python for loops**, which can be slow for large sequences. NumPy is faster than regular Python code because its C-based algorithms avoid overhead present with regular interpreted Python code.

In [51]:
import numpy as np

## Creating ndarrays
### Create an array from a list of strings

In [52]:
data1 = ['a', 'b', 'c']

np.array(data1)  # array is a function

array(['a', 'b', 'c'], dtype='<U1')

### An array of 5 zeros

In [53]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

### An array of 5 ones

In [54]:
np.ones(5)

array([1., 1., 1., 1., 1.])

### An empty array
Values are not initialized to any particular value.
  
It’s not safe to assume that numpy.empty will return an array of all zeros. This function returns uninitialized memory and thus may contain nonzero “garbage” values. You should use this function only if you intend to populate the new array with data.

In [55]:
np.empty(5)

array([1., 1., 1., 1., 1.])

### <i>arange()</i> function
numpy.arange is an array-valued version of the built-in Python range function.

In [56]:
np.arange(5)

array([0, 1, 2, 3, 4])

In [57]:
np.arange(3, 10, 2) #start, end, step

array([3, 5, 7, 9])

### <i>linspace()</i> function 

In [58]:
np.linspace(0, 6, 4)  # start, end, number of points

array([0., 2., 4., 6.])

## Two-dimensional arrays
### Create an array from a list of equal-length lists.

In [59]:
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
arr

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

### Display a number of dimensions with <i>ndim</i>


In [60]:
arr.ndim

2

### The shape of an array indicates the size of each dimension

In [61]:
arr.shape

(2, 4)

### The datatype of our array
Unless explicitely specified, <i>np.array</i> tries to infer a good data type for the array that it creates. <br/>
All datatypes: https://docs.scipy.org/doc/numpy/user/basics.types.html

In [62]:
arr.dtype

dtype('int64')

## Three-dimensional arrays
### 2 arrays, 3 rows and 2 columns

In [63]:
arr3d = np.empty((2, 3, 2))
arr3d

array([[[0.00000000e+000, 1.97626258e-323],
        [0.00000000e+000, 0.00000000e+000],
        [0.00000000e+000, 3.69776220e-062]],

       [[5.88972302e-091, 5.99520910e-066],
        [2.34059661e-057, 2.46613527e+184],
        [3.99910963e+252, 1.46030983e-319]]])

In [64]:
arr3d.ndim

3

In [65]:
arr3d.shape

(2, 3, 2)

### Create a three-dimensional array: 10 arrays, 3 rows and 2 columns:

In [66]:
my_arr = np.zeros((10, 3, 2))
my_arr

array([[[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]]])

In [67]:
my_arr.ndim

3

In [68]:
my_arr.shape

(10, 3, 2)

Unless explicitly specified (discussed in “Data Types for ndarrays”), numpy.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object.

In [100]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1.dtype

dtype('float64')

In [101]:
arr1

array([6. , 7.5, 8. , 0. , 1. ])

## Operations between Arrays and Scalars
Arithmetic operations with scalars propogate the value to each element.

### Vectorization
is the ability to perform batch operations on data without writing any for loops.



In [70]:
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr

array([[1, 2, 3],
       [4, 5, 6]])

All of the elements are multiplied by two.

In [71]:
arr * 2

array([[ 2,  4,  6],
       [ 8, 10, 12]])

Any arithmetic operations between equal-size arrays apply the operation element-wise.
  
The corresponding values in each “cell” in the array have been multiplied by each other.

In [72]:
arr * arr

array([[ 1,  4,  9],
       [16, 25, 36]])

In [73]:
arr - arr

array([[0, 0, 0],
       [0, 0, 0]])

Comparisons between arrays of the same size yield Boolean arrays:

In [74]:
arr

array([[1, 2, 3],
       [4, 5, 6]])

In [75]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

In [76]:
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

## Array Basic Indexing and Slicing
Array slices are VIEWS on the original array.

In [77]:
my_arr = np.arange(10)
my_arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### On the surface, array work similar to lists:

In [78]:
my_arr[2]

2

In [79]:
my_arr[2:5]

array([2, 3, 4])

The following example is different from the list: the original source array is modified.  
  
NumPy has been designed with large data use cases in mind.  Too much copying might result in performance and memory problems.  
   
To explicitely copy the array, use the syntax  

```python
arr[5:8].copy()
```

Assign a scalar to a slice of an array.  The value will be propataged to the entire selection.

In [80]:
my_arr[2:5] = 100

An original array is modified

In [81]:
my_arr

array([  0,   1, 100, 100, 100,   5,   6,   7,   8,   9])

### Two-dimensional array slicing and indexing
Nested sequences, like a list of equal-length lists, are converted into a multidimensional array.
  
In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays.

In [82]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

The element at each index is not a scalar, but a one-dimensional array:

In [83]:
arr2d[0]

array([1, 2, 3])

Two ways to select an individual element:

In [84]:
arr2d[0][1]

2

In [85]:
arr2d[0, 1]

2

#### Indexing elements in a NumPy array
<table>
<tr><td>0, 0</td><td>0, 1</td><td>0, 2</td></tr>
<tr><td>1, 0</td><td>1, 1</td><td>1, 2</td></tr>
<tr><td>2, 0</td><td>2, 1</td><td>2, 2</td></tr>
</table>

## Indexing with slices

### Slicing a two-dimensional array
  
A slice selects a range of elements along an axis.

In [86]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

<b>Select the following slice:</b><br/>
- rows with index 0 and 1 (2 is not included).<br/>
- columns with index 1 and 2

In [87]:
arr2d[:2, 1:]

array([[2, 3],
       [5, 6]])

<b>Select the following slice:</b> <br/>
- row with index 1
- columns with index 0 and 1

In [88]:
arr2d[1, :2]

array([4, 5])

<b>Select the following slice:</b>
- row with index 2
- column with index 0

In [89]:
arr2d[2, :1]

array([7])

<b>Select the following slice:</b>
- all rows
- column with index 0

In [90]:
arr2d[:, :1]

array([[1],
       [4],
       [7]])

## Boolean Indexing
<i>names</i> is an array of names with duplicates 

In [91]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

Comparing <i>names</i> with the string 'Bob' yields a boolean array:

In [92]:
names == 'Bob'

array([ True, False, False,  True, False, False, False])

<i>data</i> is normally distributes random data.

In [93]:
data = np.random.randn(7, 4)
data

array([[ 0.99766023, -0.69108601, -0.01304244,  1.34933994],
       [ 2.01054631,  0.56645598,  0.26396167, -0.57839722],
       [ 0.11519785,  2.59733933, -0.08343361, -0.74179977],
       [-1.09203881, -2.29036177, -0.91997765, -0.73775907],
       [ 0.71675803, -0.38929943, -0.44192648,  1.27474593],
       [ 1.05805305,  0.86820536, -0.15228736,  1.20598808],
       [ 0.06172281, -0.44865367,  0.41847013, -0.90799013]])

This boolean array can be passed when indexing the array.
  
The Boolean array must be of the same length as the array axis it’s indexing. 

In [94]:
data[names == 'Bob']

array([[ 0.99766023, -0.69108601, -0.01304244,  1.34933994],
       [-1.09203881, -2.29036177, -0.91997765, -0.73775907]])

Select from rows where names == Bob and index the columns too.

In [95]:
data[names == 'Bob', 1]

array([-0.69108601, -2.29036177])

To select everything BUT 'Bob'

In [96]:
data[names != 'Bob']

array([[ 2.01054631,  0.56645598,  0.26396167, -0.57839722],
       [ 0.11519785,  2.59733933, -0.08343361, -0.74179977],
       [ 0.71675803, -0.38929943, -0.44192648,  1.27474593],
       [ 1.05805305,  0.86820536, -0.15228736,  1.20598808],
       [ 0.06172281, -0.44865367,  0.41847013, -0.90799013]])

Option II: To select everything BUT 'Bob', negate the condition with the ~ character (*'tilde'*)

In [97]:
data [~(names == 'Bob')]

array([[ 2.01054631,  0.56645598,  0.26396167, -0.57839722],
       [ 0.11519785,  2.59733933, -0.08343361, -0.74179977],
       [ 0.71675803, -0.38929943, -0.44192648,  1.27474593],
       [ 1.05805305,  0.86820536, -0.15228736,  1.20598808],
       [ 0.06172281, -0.44865367,  0.41847013, -0.90799013]])

To select two of the three names to combine multiple Boolean conditions, use Boolean arithmetic operators like & (and) and | (or):

In [98]:
mask = (names == "Bob") | (names == "Will")
mask

array([ True, False,  True,  True,  True, False, False])

In [99]:
data[mask]

array([[ 0.99766023, -0.69108601, -0.01304244,  1.34933994],
       [ 0.11519785,  2.59733933, -0.08343361, -0.74179977],
       [-1.09203881, -2.29036177, -0.91997765, -0.73775907],
       [ 0.71675803, -0.38929943, -0.44192648,  1.27474593]])