# NB: NumPy First Steps



## NumPy

**A new data structure**

Essentially, NumPy introduces a new data structure to Python -- the **n-dimensional array**. Along with it, it introduces a collection of function and methods that take advantage of this data structure.

The data structure is designed to support the use of **numerical methods**: algorithmic approximations to the problems of mathematical analysis.

**New Functions**

It also provides a new way of appling functions to day made possible by the data structure -- **vectorized functions**. Vectorized functions replace the use of loops and comprehensions to apply a function to a set of data. 

In addition, given the data structure, it provides a library of **linear algebra** functions. 

**New Data Types**

NumPy also introduces a bunch of new **data types**.

**Python for Science**

Finally, because [numerical methods](https://www.britannica.com/science/numerical-analysis) are so important to so many sciences, NumPy is the basis of what is called **the scientific "stack"** in Python, which consists of SciPy, Matplotlib, SciKitLearn, and Pandas. All of these assume that you have some knowledge of NumPy.

Let's take a look at it.

In [None]:
import numpy as np

NumPy is by widespread convention aliased as `np`.

## The ndarray

The ndarray is a multidimensional array object.

Let's explore it some. 

First, let's generate some fake data using NumPy's built-a random number generator.

In [None]:
##| jupyter: {outputs_hidden: false}
data = np.random.randn(2, 3)

In [None]:
##| jupyter: {outputs_hidden: false}
data

In [None]:
##| jupyter: {outputs_hidden: false}
data * 10

In [None]:
##| jupyter: {outputs_hidden: false}
data + data

In [None]:
##| jupyter: {outputs_hidden: false}
data.shape

In [None]:
##| jupyter: {outputs_hidden: false}
data.dtype

### About Dimensions

The term dimension is ambiguous.
* Sometimes refers to the dimensions of things in the world, such as space and time.
* Sometimes refers to the dimensions of a data structure, independent of what it represents in the world.

NumPy dimensions are the latter, although they can be used to represent the former, as physicists do.

The dimensions of data structures are sometimes called **axes**.

Consider this: Three-dimensional space can be represented as three columns in a two-dimensional table OR as three axes in a data cube. 

## Creating ndarrays

In [None]:
##| jupyter: {outputs_hidden: false}
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

In [None]:
##| jupyter: {outputs_hidden: false}
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

In [None]:
##| jupyter: {outputs_hidden: false}
arr2.ndim

In [None]:
##| jupyter: {outputs_hidden: false}
arr2.shape

In [None]:
##| jupyter: {outputs_hidden: false}
arr1.dtype

In [None]:
##| jupyter: {outputs_hidden: false}
arr2.dtype

In [None]:
##| jupyter: {outputs_hidden: false}
np.zeros(10)

In [None]:
##| jupyter: {outputs_hidden: false}
np.zeros((3, 6))

In [None]:
##| jupyter: {outputs_hidden: false}
np.empty((2, 3, 2))

In [None]:
##| jupyter: {outputs_hidden: false}
np.arange(15)

### Data Types for ndarrays

**Unlike any of the previous data structures we have seen in Python, ndarrays must have a single data type associated with them.**

In [None]:
##| jupyter: {outputs_hidden: false}
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr1.dtype

In [None]:
##| jupyter: {outputs_hidden: false}
arr2 = np.array([1, 2, 3], dtype=np.int32)
arr2.dtype

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

In [None]:
##| jupyter: {outputs_hidden: false}
float_arr = arr.astype(np.float64)
float_arr.dtype

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr

In [None]:
##| jupyter: {outputs_hidden: false}
arr.astype(np.int32)

In [None]:
##| jupyter: {outputs_hidden: false}
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings.astype(float)

In [None]:
##| jupyter: {outputs_hidden: false}
int_array = np.arange(10)
calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)
int_array.astype(calibers.dtype)

In [None]:
##| tags: []
empty_uint32 = np.empty(8, dtype='u4')
empty_uint32

**NumPy Data Types**

```
i - integer
b - boolean
u - unsigned integer
f - float
c - complex float
m - timedelta
M - datetime
O - object
S - string
U - unicode string
V - fixed chunk of memory for other type ( void )
```

### Arithmetic

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

In [None]:
arr.shape

In [None]:
##| jupyter: {outputs_hidden: false}
arr * arr

In [None]:
##| jupyter: {outputs_hidden: false}
arr - arr

In [None]:
##| jupyter: {outputs_hidden: false}
1 / arr

In [None]:
##| jupyter: {outputs_hidden: false}
arr ** 0.5

In [None]:
##| jupyter: {outputs_hidden: false}
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2

In [None]:
##| jupyter: {outputs_hidden: false}
arr2 > arr

## Basic Indexing and Slicing

In [None]:
foo = np.zeros((4,6))
foo.shape
foo[2:,:1].shape

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.arange(10)
arr

In [None]:
##| jupyter: {outputs_hidden: false}
arr[5]

In [None]:
##| jupyter: {outputs_hidden: false}
arr[5:8]

In [None]:
##| jupyter: {outputs_hidden: false}
arr[5:8] = 12

In [None]:
##| jupyter: {outputs_hidden: false}
arr

Notice that if we assign a scalar to a slice, all of the elements of the slice get that value. This is called **broadcasting**.

Also, notice that changes to slices are changes to the arrays they are slices of. They are **views**, not copies. 

In [None]:
##| jupyter: {outputs_hidden: false}
arr_slice = arr[5:8]
arr_slice

In [None]:
##| jupyter: {outputs_hidden: false}
arr_slice[1] = 12345
arr

In [None]:
##| jupyter: {outputs_hidden: false}
arr_slice[:] = 64
arr

In [None]:
arr_slice

As NumPy has been designed with large data use cases in mind, you could imagine performance and memory problems if NumPy insisted on copying data left and right.

⭐ If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array; for example `arr[5:8].copy()`.

**Higher Dimensional Arrays**

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d[2]

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d[0][2]

**Simplified notation**

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d[0, 2]

A nice visual of a 2D array

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781449323592/files/httpatomoreillycomsourceoreillyimages2172112.png" height="50%" width="50%"/>

**Two-Demensional Array Slicing**

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781449323592/files/httpatomoreillycomsourceoreillyimages2172114.png" height="50%" width="50%"/>

**3D arrays**

In [None]:
##| jupyter: {outputs_hidden: false}
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

In [None]:
arr3d.shape

In [None]:
##| jupyter: {outputs_hidden: false}
arr3d

If find NumPy's way of show the data a bit difficult to parse visually.

💡 **Here is a way to visualize 3 and higher dimensional data:**

```python
[ # AXIS 0                     CONTAINS 2 ELEMENTS (arrays)
    [ # AXIS 1                 CONTAINS 2 ELEMENTS (arrays)
        [1, 2, 3], # AXIS 3    CONTAINS 3 ELEMENTS (integers)
        [4, 5, 6]  # AXIS 3
    ],  
    [ # AXIS 1
        [7, 8, 9], 
        [10, 11, 12]
    ]
]
```
Each axis is a level in the nested hierarchy, i.e. a tree or DAG (directed-acyclic graph).

* Each axis is a container.
* There is only one top container.
* Only the bottom containers have data.

**Omit lower indices**

In multidimensional arrays, if you omit later indices, the returned object will be a **lower-dimensional ndarray** consisting of all the data contained by the higher indexed dimension. 

So in the 2 × 2 × 3 array `arr3d`:

In [None]:
##| jupyter: {outputs_hidden: false}
arr3d[0]

Saving data before modifying an array.

In [None]:
##| jupyter: {outputs_hidden: false}
old_values = arr3d[0].copy()
arr3d[0] = 42
arr3d

Putting the data back.

In [None]:
##| jupyter: {outputs_hidden: false}
arr3d[0] = old_values
arr3d

Similarly, `arr3d[1, 0]` gives you all of the values whose indices start with (1, 0), forming a 1-dimensional array:

In [None]:
##| jupyter: {outputs_hidden: false}
arr3d[1, 0]

In [None]:
##| jupyter: {outputs_hidden: false}
x = arr3d[1]
x

In [None]:
##| jupyter: {outputs_hidden: false}
x[0]

### Indexing with slices

In [None]:
##| jupyter: {outputs_hidden: false}
arr

In [None]:
##| jupyter: {outputs_hidden: false}
arr[1:6]

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d[:2]

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d[:2, 1:]

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d[1, :2]

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d[:2, 2]

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d[:, :1]

In [None]:
##| jupyter: {outputs_hidden: false}
arr2d[:2, 1:] = 0
arr2d

### Boolean Indexing

This a crucial topic -- it applies to Pandas and R. 

You can pass a boolean representation of an array to the array indexer (i.e. `[]`) and it will return only those cells that are `True`.

This is like replace `IF` statements with a `dict`.

Let's assume that we have two related arrays:
* `names` which is the rows (observations) of a table
* `data` which holds the data associated with each feature

There are $7$ observations and $4$ features.

In [None]:
##| jupyter: {outputs_hidden: false}
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names

In [None]:
##| jupyter: {outputs_hidden: false}
data = np.random.randn(7, 4)
data

In [None]:
names.shape, data.shape

A comparison operation in an array returns an array of booleans.

In [None]:
##| jupyter: {outputs_hidden: false}
names == 'Bob'

This array can be passed to an array indexer:

In [None]:
##| jupyter: {outputs_hidden: false}
data[names == 'Bob']

Along the second axis, we can use a slice to select data.

In [None]:
##| jupyter: {outputs_hidden: false}
data[names == 'Bob', 2:]

In [None]:
##| jupyter: {outputs_hidden: false}
data[names == 'Bob', 3]

If you know SQL, this is like the query:

```sql
SELECT col3, col4 FROM data WHERE name = 'Bob'
```

Here are some examples of boolean operations being applied.

Note that we don't use `not` but instead the tilde `~` sign to negate (flip) a value.

In [None]:
##| jupyter: {outputs_hidden: false}
names != 'Bob'
data[~(names == 'Bob')]

In [None]:
##| jupyter: {outputs_hidden: false}
cond = names == 'Bob'
data[~cond]

Similarly, we don't use `and` and `or` but `&` and `|`.

Also, expressions join by these operators need to be in parentheses.

In [None]:
##| jupyter: {outputs_hidden: false}
mask = (names == 'Bob') | (names == 'Will')
mask
data[mask]

In [None]:
##| jupyter: {outputs_hidden: false}
data[data < 0] = 0
data

In [None]:
##| jupyter: {outputs_hidden: false}
data[names != 'Joe'] = 7
data

### Fancy Indexing

In so-call fancy indexing, we use array index numbers to access data.

We pass a `list` of item numbers, instead of an integer or integer range with `:`, to the indexer.

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.empty((8, 4))
for i in range(8):
    arr[i] = i
arr

This says: _Select rows 4, 3, 0, and 6, in that order._

In [None]:
##| jupyter: {outputs_hidden: false}
arr[[4, 3, 0, 6]]

And we can go backwards.

In [None]:
##| jupyter: {outputs_hidden: false}
arr[[-3, -5, -7]]

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.arange(32).reshape((8, 4))
arr

We can drop these lists into each axis seletor.

In [None]:
##| jupyter: {outputs_hidden: false}
arr[[1, 5, 7, 2], [0, 3, 1, 2]]

In [None]:
##| jupyter: {outputs_hidden: false}
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]

### Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping which similarly returns a view on the underlying data without copying anything. 

Arrays have the transpose method and also the special `T` attribute:

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.arange(15).reshape((3, 5))
arr

In [None]:
##| jupyter: {outputs_hidden: false}
arr.T

Transposing is often used when computing the dot product between two arrays.

Here's an example.

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.random.randn(6, 3)
arr

In [None]:
##| jupyter: {outputs_hidden: false}
np.dot(arr.T, arr)

For higher dimensional arrays, `transpose` will accept a tuple of axis numbers to permute the axes.

Warning -- this can get confusing to conceptualize and visualize!

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.arange(16).reshape((2, 2, 4))
arr

In [None]:
##| jupyter: {outputs_hidden: false}
arr.transpose((1, 0, 2))

Simple transposing with `.T` is just a special case of swapping axes. ndarray has the method `swapaxes` which takes a pair of axis numbers:

In [None]:
##| jupyter: {outputs_hidden: false}
arr

In [None]:
##| jupyter: {outputs_hidden: false}
arr.swapaxes(1, 2)

## Universal Functions

A universal function, or `ufunc`, is a function that performs elementwise operations on data in ndarrays. You can think of them as **fast vectorized wrappers for simple functions** that take one or more scalar values and produce one or more scalar results.

Many `ufuncs` are simple elementwise transformations, like `sqrt` or `exp`:

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.arange(10)
arr

In [None]:
##| jupyter: {outputs_hidden: false}
np.sqrt(arr)

In [None]:
##| jupyter: {outputs_hidden: false}
np.exp(arr)

In [None]:
##| jupyter: {outputs_hidden: false}
x = np.random.randn(8)
x

In [None]:
##| jupyter: {outputs_hidden: false}
y = np.random.randn(8)
y

In [None]:
##| jupyter: {outputs_hidden: false}
np.maximum(x, y)

In [None]:
##| jupyter: {outputs_hidden: false}
arr = np.random.randn(7) * 5
arr

In [None]:
##| jupyter: {outputs_hidden: false}
remainder, whole_part = np.modf(arr)
remainder

In [None]:
##| jupyter: {outputs_hidden: false}
whole_part

In [None]:
##| jupyter: {outputs_hidden: false}
arr

In [None]:
##| jupyter: {outputs_hidden: false}
np.sqrt(arr)

In [None]:
##| jupyter: {outputs_hidden: false}
np.sqrt(arr, arr)

In [None]:
##| jupyter: {outputs_hidden: false}
arr

`nan` is a special value in NumPy.
