In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

# NumPy
<!-- requirement: images/Sparse_vs_Dense_Matrices.svg -->
<!-- requirement: images/Numpy_Array_Vs_Python_List.svg -->


### Goals


 - What is NumPy?: `ndarray`, `matrix`, and some operations

In [2]:
# These are the standard "qualified" (as) imports
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns  # makes plots pretty

## NumPy


If you've ever used MATLAB, you know that writing out loops to add two vectors / find dot products / multiply two matrices / etc. is possible but _very_ slow.  Instead, you should use "vectorized" built-in operations that can do the looping in a faster language or even pick a better algorithm.

Python is similar.  _NumPy_ is the package that provides the means to do performant numerical calculations in Python.  If you've converted your problem into linear algebra and matrices, then _NumPy_ will let you write it to run fast.

**Why are Python arrays unsuitable for numerical computation?**

There are two basic reasons why Python on its own is insufficient here:
  - _Data structure._  A Python list is a complicated thing. Just consider something like:
```        
x = [1, "23", BeautifulSoup(urlopen("https://www.google.com/#q=4")), 5]
``` 
    where x[0] and x[3] are numbers (of some sort), x[1] is a string, and x[2] is a complicated object.  If you're familiar with a low-level language like C, just imagine how this must be stored in memory: 
    
    > In the typical Python implementation, this might be stored as a linked list of pointers to "Python object" data structures which in turn store what class the object is an instance of, a pointer to a dictionary (i.e. hash table) of instance variables, and a pointer to a dictionary of class variables.  This is reasonable for x[2], but for x[0] and x[3]...
    
  - _Typing and dispatch._  When we write `x[0] + x[3]`, what happens?  You can overload `+` for all sorts of purposes in Python, and the decision of exactly what `+` means happens at run-time by a dictionary look-up.  If you were term-wise adding two arrays, `x` and `y`, then because arrays can contain elements of different types this has to happen _for each term_.

**What NumPy does for us:**

The basic thing that NumPy does is avoid these two problems by using ordinary C-style arrays of integers, floating point numbers, etc., along with functions that operate on them intelligently. It also gives us C-style higher dimensional arrays.

Note that C-style arrays are good for more than just quickly performing operations through Python; they're also good for talking to existing C and Fortran code. This interoperability explains why NumPy matters to you even if you won't do any matrix computations by hand: many of the libraries that you _will_ want to use will use NumPy arrays under the hood.

![NumPy Array vs. Python List](images/Numpy_Array_Vs_Python_List.svg "Numpy Array vs. Python List")
<!-- Source: https://docs.google.com/drawings/d/1qsm90ZnesvtRr0_Y_hpJag5nragWv4fmkmfpBZRbxCQ/edit -->

## Data types (the nouns):


### `np.ndarray`

This is a C-style "n-D" array.  That is, it is just a big contiguous block of integers (or floats, or... but just one type per array) together with a factorization of its size into "dimensions"
  
  $$       N = n_1 n_2 ... n_d        $$
  
  In other words the arrays that you might denote [1,2,3,4] and [[1,2],[3,4]] have the same underlying block of values, just with different dimensions: the first one has [4], while the second [2,2].
  
 For an alternate visual: Imagine a grid and numbering it by reading left to right -- next row -- left to right -- next row, etc.   For instance, in C the following bits of code are functionally equivalent
  
  >        
          int chessboard[64];
          //Do something
          chessboard[8*row + column] += 1;
  
  and
  
  >        
          int chessboard[8][8];
          //Do something
          chessboard[row][column] += 1;
          
  In this case the numbering (i.e. mapping to a single flat list of numbers) goes
  >        
          0  1  2  3  4  5  6  7
          8  9 10 11 ..
          16 ..
          ..


In [3]:
X = np.empty((2,3))  # allocates memory but does not write to it (dangerous)
X

array([[0., 0., 0.],
       [0., 0., 0.]])

In [4]:
Y = np.zeros((2,3))  # array of all zeros
Y

array([[0., 0., 0.],
       [0., 0., 0.]])

In [5]:
X.shape

(2, 3)

In [6]:
X.shape == Y.shape

True

In [7]:
Z = np.ones((2,3))  # array of all ones
Z

array([[1., 1., 1.],
       [1., 1., 1.]])

In [8]:
Y.shape == Z.shape

True

In [9]:
Z2 = np.ones_like(Y)  # array of ones with the shape of Y

In [10]:
Z == Z2

array([[ True,  True,  True],
       [ True,  True,  True]])

In [11]:
np.all(Z == Z2)

True

In [12]:
np.all(Z - Z == Y)

True

### `np.matrix`


The case of 2D arrays, or "matrices," is given a special wrapper with different operations.  These are slightly different than arrays, which we'll explain later.

In [13]:
x = np.matrix(range(5))
y = np.arange(5)
np.all(x == y)

True

In [14]:
type(x) is type(y)

False

In [15]:
isinstance(x,np.ndarray) # np.matrix is a subclass of np.ndarray

True

In [16]:
import numpy.matlib  # this import is oddly necessary for matrix code to work
matrix_I = np.matlib.identity(3)
array_I = np.eye(3)

np.all(matrix_I == array_I), type(matrix_I) is type(array_I)

(True, False)

## Operations (the verbs):


In broad types the things we can do are:
  - Create arrays.
  - Slicing or reshaping: Taking a sub-block of a block of values.  Both slicing and reshaping are  examples of a "view" or a "shallow copy," because they do not actually copy the underlying block of data.
  - "Universal functions": This is NumPy's name for functions that are applied term-by-term, like the arithmetic operations or `sin`.
  - Linear algebra / matrix operations.
  - Mathematical convenience functions: FFT, etc.
  
Here's a table that shows some example syntax:
  
   Command  |  Explanation
   ---------|--------------
  `np.array(python_list, dtype='int')` | Convert a Python list to an `np.array`.  The `dtype` can be one of several things, such as 'int64', 'float32', 'float64', etc.
  `np.ndarray(shape=[1,2,3], buffer=an_np_array, dtype='int')`  | Makes a higher dimensional array whose underlying block of data is the given `np.array`.
  `np.arange(-5,5,1)` | Like Python's range, but slightly faster than `np.array(range(-5,5,1))`.
  `+`, `*`, `-`, `/`, `np.sin`, ... | All of the standard numerical and mathematical functions are back.  They always operate term-by-term.  That is, `x+y` is ordinary vector addition but `x*y` is term-wise product (not dot product).
   `np.dot(x,y)` or `x.dot(y)` | Inner product (along the last dimension, for n-D arrays).  Note that this includes matrix multiplication for 2-D arrays.
   `an_np_array.reshape([1,2,3])`  |  Reshape an `np.array` or `np.ndarray` to one with different shape (but of the same size).
   
All pretty simple!  Let's do a few quick examples.

In [17]:
x = np.arange(-5, 5, 1)  # NumPy will make intelligent guesses about your intended data type
x                        # And will convert between them if needed:

array([-5, -4, -3, -2, -1,  0,  1,  2,  3,  4])

In [18]:
import math
y = np.sin(x)
y

array([ 0.95892427,  0.7568025 , -0.14112001, -0.90929743, -0.84147098,
        0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ])

In [19]:
x.dtype, y.dtype

(dtype('int32'), dtype('float64'))

In [20]:
np.linspace(-5, 5, 10)

array([-5.        , -3.88888889, -2.77777778, -1.66666667, -0.55555556,
        0.55555556,  1.66666667,  2.77777778,  3.88888889,  5.        ])

In [21]:
np.logspace(0, 4, 9)

array([1.00000000e+00, 3.16227766e+00, 1.00000000e+01, 3.16227766e+01,
       1.00000000e+02, 3.16227766e+02, 1.00000000e+03, 3.16227766e+03,
       1.00000000e+04])

Basic math operations:

In [22]:
2 * x

array([-10,  -8,  -6,  -4,  -2,   0,   2,   4,   6,   8])

In [23]:
x * x

array([25, 16,  9,  4,  1,  0,  1,  4,  9, 16])

In [24]:
np.dot(x, x)

85

In [25]:
x.dot(x)

85

In [26]:
np.sqrt(x**2 + y**2)

array([5.09112323, 4.07096426, 3.00331731, 2.19700292, 1.30693283,
       0.        , 1.30693283, 2.19700292, 3.00331731, 4.07096426])

In [None]:
# 1*3 . 3*1
#3*1  X 1*3

In [27]:
np.outer(x, x)

array([[ 25,  20,  15,  10,   5,   0,  -5, -10, -15, -20],
       [ 20,  16,  12,   8,   4,   0,  -4,  -8, -12, -16],
       [ 15,  12,   9,   6,   3,   0,  -3,  -6,  -9, -12],
       [ 10,   8,   6,   4,   2,   0,  -2,  -4,  -6,  -8],
       [  5,   4,   3,   2,   1,   0,  -1,  -2,  -3,  -4],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [ -5,  -4,  -3,  -2,  -1,   0,   1,   2,   3,   4],
       [-10,  -8,  -6,  -4,  -2,   0,   2,   4,   6,   8],
       [-15, -12,  -9,  -6,  -3,   0,   3,   6,   9,  12],
       [-20, -16, -12,  -8,  -4,   0,   4,   8,  12,  16]])

In [28]:
np.outer(x, x).dot(x)

array([-425, -340, -255, -170,  -85,    0,   85,  170,  255,  340])

**Question:** How would you multiply two matrices?

You can do basic stats on arrays:

In [29]:
x.mean(), x.std()

(-0.5, 2.8722813232690143)

You can reshape an array, as long as the number of elements remains the same.  Note that the result of a reshape is a view onto the same data.  Changing one will change the other.

In [30]:
z = x.reshape(2, 5)
z

array([[-5, -4, -3, -2, -1],
       [ 0,  1,  2,  3,  4]])

In [31]:
z[1,1] = 10
x

array([-5, -4, -3, -2, -1,  0, 10,  2,  3,  4])

If you need a deep copy instead of a view, use the .copy() method.

In [32]:
z = x.reshape(2,5).copy()
z[1,1] = 20
x

array([-5, -4, -3, -2, -1,  0, 10,  2,  3,  4])

In [33]:
x.reshape(5, -1)

array([[-5, -4],
       [-3, -2],
       [-1,  0],
       [10,  2],
       [ 3,  4]])

You can also create an array of arbitrary shape and type:

In [34]:
np.ndarray(shape=(2,3,3), dtype=int, buffer=np.arange(1,100))

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])

### Matrix Multiplication and Transpose


1. Matrix multiplication, on 2-D arrays, is denoted by `A.dot(B)`.  Transpose is denoted by `A.T`.  Fair enough. 
2. If `A` is a 1-D array, then we often think of it as a vector.  This will usually give you the right "linear algebra notation" answer, e.g. if `M` is a 2-D array of the right size then
        A.dot(M)
        M.dot(A)
   represent the matrix products you think (where `A` is turned into a row or column vector as needed).
   
   If `A` is a 1-D array, then it is not the case that `A` is always a row (or column) vector -- NumPy makes creative guesses about how to interpret it as a higher dimensional array.  For the matrix outer product $A^T A$ one must use `np.outer(A)` or _explicitly reshape_ `A` as a column vector, i.e. a 2-D array with just one column:

In [35]:
np.all(x.dot(x.T) == np.outer(x, x)) # x.T doesn't work on 1-D arrays

False

In [36]:
v = x[:, np.newaxis] 
np.all(v.dot(v.T) == np.outer(x, x))

True

In [37]:
v = x.reshape(-1,1)
np.all(v.dot(v.T) == np.outer(x, x))

True

In [38]:
x.shape, v.shape

((10,), (10, 1))

### Performance of NumPy arrays vs. python lists


It's much faster to use NumPy arrays than python lists.

In [39]:
xl = range(10000)
yl = range(10000)
xa = np.arange(10000)
ya = np.arange(10000)

In [40]:
%%timeit -n3

[i + j for i, j in zip(xl, yl)]

1.41 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)


In [41]:
%%timeit -n3

xa + ya

The slowest run took 17.96 times longer than the fastest. This could mean that an intermediate result is being cached.
50.6 µs ± 64.8 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)


### Array indexing


NumPy provides lots of ways to index into lists.

In [42]:
x = np.arange(10)
y = x[2:5]  # like python list indexing
len(y)

3

In [43]:
# what's the difference?
x[2:3], x[2]

(array([2]), 2)

In [44]:
# select based on a condition
x[x % 2 == 1]

array([1, 3, 5, 7, 9])

NumPy allows you to play fast and loose with indexing

In [None]:
A = np.empty((3,4))

# Constant assigned to entire row
for i in range(A.shape[0]):
    A[i] = 2. * i
    
print(A)

You can also pass in a list (or `np.array`) of indices

In [None]:
X = 2. * np.arange(10)
X[range(9,-1,-1)]  # The reverse of the list

In [None]:
X[range(9,-1,-1)] = 3. * np.arange(10)   # Assign to the reverse of this list
X

### Conditional selection and max / min


The `np.where` construct allows you to take values from one array when a condition is true and the values from another array when they are false.

In [None]:
x = np.arange(10)
np.where(x > 5, x, 0)

This can also be done manually, by taking advantage of the fact that `True` and `False` behave like `1` and `0` in math statements in Python.

In [None]:
(x > 5) * x + 0 * (x <= 5)

`np.maximum` and `np.minimum` can compare two arrays or an array to a constant.

In [None]:
x = np.arange(10)
y = np.arange(9, -1, -1)
print("The larger of x and y")
print(np.maximum(x, y))
print()
print("x capped at 5")
print(np.minimum(x, 5))

### Matrix Inversion


While you can invert a 2-D NumPy array using `np.linalg.inv`, it is recommended that you use `np.linalg.solve`, which is much more efficient

In [None]:
A = np.array([[3,1], [1,2]])
x = np.array([9,8])
np.linalg.solve(A, x)

### Singular Value Decomposition


Finally, you can do the [SVD decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) of a 2-D NumPy array

In [None]:
A = np.arange(24).reshape(4,6)
U, x, V = np.linalg.svd(A)
diag = np.zeros((4,6))
diag[range(4), range(4)] = x
diff = U.dot(diag).dot(V) - A

np.abs(diff).max()

## Persisting NumPy objects


You can save NumPy objects as using `np.save` and `np.load` functions.  You can also save multiple NumPy objects using the `np.savez` function.

In [None]:
x = np.arange(10)
np.save('x.npy', x)
#del x
y = np.load('x.npy')
np.all(y == np.arange(10))