# Numpy
Numpy is the array package of Python. It isn't built-in, but is very well integrated to the Python ecosystem.

Numpy isn't matrices.  It isn't vectors.  It is `ndarray`s - n-dimensional arrays, so like most things in Python it is very flexible and can be used for almost anything.

`ndarray` is not a matrix - but there are more classes for that.

Numpy arrays are good for basic arithmetic and linear algebra.  For a more usable version for typical data analysis, we will look at the `pandas` data structures next

## Base numpy array properties
* N-dimensional
* Uniform data type (dtype)

## Creating arrays
There are many ways to create arrays, most importantly...
* `np.array`
* `np.asarray`
* Creator functions that return new data: `np.zeros`, `np.ones`, `np.arange`, `np.random.*`, etc.

In [2]:
import numpy as np

In [3]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [7]:
np.ones((3,4,2,2,5))

array([[[[[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]],

         [[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]]],


        [[[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]],

         [[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]]],


        [[[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]],

         [[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]]],


        [[[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]],

         [[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]]]],



       [[[[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]],

         [[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]]],


        [[[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]],

         [[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]]],


        [[[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]],

         [[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]]],


        [[[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1

In [9]:
a = np.arange(12)
a.shape = (3, 4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [10]:
a.shape = (3,2,2)
a

array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]]])

In [19]:
np.array([1,2,3,4,5,6])

array([1, 2, 3, 4, 5, 6])

In [20]:
np.array([[9,8,7],[6,5,4],[3,2,1]])

array([[9, 8, 7],
       [6, 5, 4],
       [3, 2, 1]])

Different data types can be specified with `dtype=`.  This includes specification of arbitrary C datatypes (refer to docs to learn how to specify them)

In [22]:
np.array([[9,8,7],[6,5,4],[3,2,1]], dtype=float)

array([[9., 8., 7.],
       [6., 5., 4.],
       [3., 2., 1.]])

In [30]:
a = np.array([[9,8,7],[6,5,4],[3,2,1]], dtype=np.float32)
#a.dtype = np.int32
#a
np.array(a, dtype=np.complex128)

array([[9.+0.j, 8.+0.j, 7.+0.j],
       [6.+0.j, 5.+0.j, 4.+0.j],
       [3.+0.j, 2.+0.j, 1.+0.j]])

# Slicing and views
* Numpy allows you to **slice** arrays, which basically means extract subsets.
* You can do this on any axis equivalently, and it is very powerful.
* Numpy by default tries to avoid copies - different views point to the same physical memory location.  Modifying one array modifies everything that references it.
* It can take some time to fully understand it, especially the N-dimensional nature of arrays!

In [31]:
a = np.arange(24)
a.shape = 6, 4
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [32]:
# Slice the first axis
a[0]

array([0, 1, 2, 3])

In [35]:
a[1,0]

4

In [37]:
# Slice the second axis.  How do you tell the axes apart?
a[:,0]

array([ 0,  4,  8, 12, 16, 20])

In [39]:
# Remember theb Python slicing syntax, start:stop:step?  Same applies here
# Example with Python lists:
l = list(range(10))
print(l)
print(l[5:])
print(l[:3])
print(l[1::2])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[5, 6, 7, 8, 9]
[0, 1, 2]
[1, 3, 5, 7, 9]


In [41]:
# With arrays, it is the same:
a[:2]

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [42]:
a[:,:2]

array([[ 0,  1],
       [ 4,  5],
       [ 8,  9],
       [12, 13],
       [16, 17],
       [20, 21]])

In [43]:
a[1:3,1:3]

array([[ 5,  6],
       [ 9, 10]])

Remember we said that copies are views into the same data?  Having full control of copying is critical to any kind of efficient programming!

In [None]:
a[0,0] = 5

In [48]:
a2 = a.copy() # Force a copy.  np.array also does this.
a3 = a2[1:3]  # a3 is view to a2
a3[:] = 0     # edit a3
a2            # a2 changes, too!

array([[ 0,  1,  2,  3],
       [ 0,  0,  0,  0],
       [ 0,  0,  0,  0],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

## Advanced indexing
### Boolean indexing
You can do advanced slicing too, using boolean arrays.  Basically, doing comparison operations on an array gives you a boolean array of trues and falses.  If you use this to slice, it is basically used as bitmap to select things out.

This turns out to be used quite a lot in many cases for selection.

(Boolean indexing isn't the only method of advanced indexing)

Advanced: how does this relate to memory management?

In [58]:
# Only one True value
a == 1

array([[False,  True, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False]])

In [61]:
a[a==1]

array([1])

In [62]:
# Note that dimension gets reduced
a[a > 5]

array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
       23])

In [63]:
# Every row, where the first element is a multiple of 3
a[a[:,0]%3 == 0]

array([[ 0,  1,  2,  3],
       [12, 13, 14, 15]])

In [64]:
# Do the same for columns.  Now we start getting to a "trial and error" point!
columns_to_select = a[0]%3 == 0
a[:, columns_to_select]

array([[ 0,  3],
       [ 4,  7],
       [ 8, 11],
       [12, 15],
       [16, 19],
       [20, 23]])

# Numpy documentation tour

Numpy has good documentation: 
* User guide: https://docs.scipy.org/doc/numpy/user/index.html
* Reference guide: https://docs.scipy.org/doc/numpy/reference/index.html

One thing which is often left out is "how to read documentation", so take a brief tour of these.

Being able to use the reference guides is very valuable...

# Exercise 03.1

* 1) Create array of size 5x6 with sequential values (`np.arange`)

In [100]:
b = np.arange(30)
b.shape = 5, 6
b

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

* 2) How do you transpose the array?

In [90]:
b.T

array([[ 0,  6, 12, 18, 24],
       [ 1,  7, 13, 19, 25],
       [ 2,  8, 14, 20, 26],
       [ 3,  9, 15, 21, 27],
       [ 4, 10, 16, 22, 28],
       [ 5, 11, 17, 23, 29]])

- 3) Slice the top and rows away.

In [91]:
b[1:][:-1]

array([[ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23]])

- 4) replace every other element with 0.  Replace every third element with 1.

In [101]:
b[:,::2] = 0
b.shape = 30,
b[::3] = 1
b.shape = 5,6
b

array([[ 1,  1,  0,  1,  0,  5],
       [ 1,  7,  0,  1,  0, 11],
       [ 1, 13,  0,  1,  0, 17],
       [ 1, 19,  0,  1,  0, 23],
       [ 1, 25,  0,  1,  0, 29]])

- 5) Create matrices and try multiplying them.  Create the same arrays and try multiplying.  What's the difference? (hint: `np.matrix`)

In [108]:
a = np.array([[0,1],[2,3]])
m = np.matrix([[0,1],[2,3]])
print(a*a)
print(m@m)

[[0 1]
 [4 9]]
[[ 2  3]
 [ 6 11]]


- 6) Bonus: try to understand the axis ordering system.  Can you find some logic to first and last axes, and how things are displayed?

- 7) Bonus: learn about advanced indexing and the `where` function.  Extract only multiples of three from your array.

# Basic array operations
* Of course, once you have arrays, you want to use them.

* The most basic type of function is `ufuncs`: "universal functions"
  * They take arrays or scalars as input
  * They operate elementwise
  * They are optimized functions written in C and are generally clever about how they do things?
  
* What's the benefit of ufuncs?  Universal, work on any data type.

* They engage in something called **broadcasting**: if the sizes don't match up, they are increased in size.

* They exist as `np.add`, etc, but normal operators `+` work too, where they exist.

* These functions provide **vectorization**: one operation operates on a lot of data at once.
  * If you ever start doing `for` loops, think: can I use vectorized operations?  A lot of the time, the answer is yes.

In [65]:
a = np.arange(24)
a.shape = 6, 4
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [None]:
a + 10

In [None]:
np.add(a, 10)    # explicit function call - relevant for more advanced functions

It's obvious that this adds 10.  What about something else?

In [71]:
np.add(a, [10,20,30,40])

array([[10, 21, 32, 43],
       [14, 25, 36, 47],
       [18, 29, 40, 51],
       [22, 33, 44, 55],
       [26, 37, 48, 59],
       [30, 41, 52, 63]])

In [72]:
print(np.array([[10],[20],[30],[40],[50],[60]]))
np.add(a, [[10],[20],[30],[40],[50],[60]])

[[10]
 [20]
 [30]
 [40]
 [50]
 [60]]


array([[10, 11, 12, 13],
       [24, 25, 26, 27],
       [38, 39, 40, 41],
       [52, 53, 54, 55],
       [66, 67, 68, 69],
       [80, 81, 82, 83]])

Ufuncs can take an output argument.  This avoids extra copying of data!  This is when you really want to pay attention when making many different copies.

In [73]:
b = a.copy()             # output array
np.multiply(a, a, b)
print(b)

[[  0   1   4   9]
 [ 16  25  36  49]
 [ 64  81 100 121]
 [144 169 196 225]
 [256 289 324 361]
 [400 441 484 529]]


In [77]:
print(id(b))
b += a
b
print(id(b))

47020486928224
47020486928224


In [80]:
print(id(b))
b = np.multiply(b, 2)
print(id(b))

47020486928224
47020486960112


Note about operations:

* Numpy, by default, always uses element-by-element operations.  For example, try to multiply two matrices together:  Matlab by default does matrix multiply.
* Why is this?  Consistency: all operations work the same no matter what you have.  Numpy is designed so that if you have a function built for normal Python objects, it will work for arrays as well giving you automatic vectorization.  If you have a function that works for arrays, it will work for arrays of higher dimensions too via broadcasting.
* Explicit matrix multiplication operator:  a @ b
* See also the `matrix` class and modules.

## Axiswise ufuncs
Ufuncs can apply across one axis only!

In [81]:
np.sum(a, axis=0)

array([60, 66, 72, 78])

In [82]:
np.sum(a, axis=1)

array([ 6, 22, 38, 54, 70, 86])

In [83]:
np.sum(a)

276

Other functions like this: `sum`, `prod`, `diff`, `nansum`, etc.

# Reading data: numpy limits
Here, we read a particular dataset called the "iris" dataset.  We don't do anything practical here, but use it as an example to show what works and doesn't.

In [138]:
# What's in the dataset: we see five columns, first three are numbers but last is text
open("../data/iris.data").readlines()[:3]

['5.1,3.5,1.4,0.2,Iris-setosa\n',
 '4.9,3.0,1.4,0.2,Iris-setosa\n',
 '4.7,3.2,1.3,0.2,Iris-setosa\n']

In [139]:
#Load with automatically guessed datatypes
iris = np.genfromtxt('../data/iris.data', delimiter=',')
iris

array([[5.1, 3.5, 1.4, 0.2, nan],
       [4.9, 3. , 1.4, 0.2, nan],
       [4.7, 3.2, 1.3, 0.2, nan],
       [4.6, 3.1, 1.5, 0.2, nan],
       [5. , 3.6, 1.4, 0.2, nan],
       [5.4, 3.9, 1.7, 0.4, nan],
       [4.6, 3.4, 1.4, 0.3, nan],
       [5. , 3.4, 1.5, 0.2, nan],
       [4.4, 2.9, 1.4, 0.2, nan],
       [4.9, 3.1, 1.5, 0.1, nan],
       [5.4, 3.7, 1.5, 0.2, nan],
       [4.8, 3.4, 1.6, 0.2, nan],
       [4.8, 3. , 1.4, 0.1, nan],
       [4.3, 3. , 1.1, 0.1, nan],
       [5.8, 4. , 1.2, 0.2, nan],
       [5.7, 4.4, 1.5, 0.4, nan],
       [5.4, 3.9, 1.3, 0.4, nan],
       [5.1, 3.5, 1.4, 0.3, nan],
       [5.7, 3.8, 1.7, 0.3, nan],
       [5.1, 3.8, 1.5, 0.3, nan],
       [5.4, 3.4, 1.7, 0.2, nan],
       [5.1, 3.7, 1.5, 0.4, nan],
       [4.6, 3.6, 1. , 0.2, nan],
       [5.1, 3.3, 1.7, 0.5, nan],
       [4.8, 3.4, 1.9, 0.2, nan],
       [5. , 3. , 1.6, 0.2, nan],
       [5. , 3.4, 1.6, 0.4, nan],
       [5.2, 3.5, 1.5, 0.2, nan],
       [5.2, 3.4, 1.4, 0.2, nan],
       [4.7, 3

In [140]:
# Force Python object datatypes
iris = np.genfromtxt('../data/iris.data', delimiter=',', dtype=object)
iris

array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa'],
       [b'4.9', b'3.0', b'1.4', b'0.2', b'Iris-setosa'],
       [b'4.7', b'3.2', b'1.3', b'0.2', b'Iris-setosa'],
       [b'4.6', b'3.1', b'1.5', b'0.2', b'Iris-setosa'],
       [b'5.0', b'3.6', b'1.4', b'0.2', b'Iris-setosa'],
       [b'5.4', b'3.9', b'1.7', b'0.4', b'Iris-setosa'],
       [b'4.6', b'3.4', b'1.4', b'0.3', b'Iris-setosa'],
       [b'5.0', b'3.4', b'1.5', b'0.2', b'Iris-setosa'],
       [b'4.4', b'2.9', b'1.4', b'0.2', b'Iris-setosa'],
       [b'4.9', b'3.1', b'1.5', b'0.1', b'Iris-setosa'],
       [b'5.4', b'3.7', b'1.5', b'0.2', b'Iris-setosa'],
       [b'4.8', b'3.4', b'1.6', b'0.2', b'Iris-setosa'],
       [b'4.8', b'3.0', b'1.4', b'0.1', b'Iris-setosa'],
       [b'4.3', b'3.0', b'1.1', b'0.1', b'Iris-setosa'],
       [b'5.8', b'4.0', b'1.2', b'0.2', b'Iris-setosa'],
       [b'5.7', b'4.4', b'1.5', b'0.4', b'Iris-setosa'],
       [b'5.4', b'3.9', b'1.3', b'0.4', b'Iris-setosa'],
       [b'5.1', b'3.5', b'1.4',

In the next part, we will show how pandas makes this much better.

# Numpy as a glue
* Numpy is the common interface between different extension libraries
* For example, if you  have a C code that takes arrays, you can construct numpy arrays and pass the pointers to your C functions.
* Common language of scipy, pandas, and so on...
* Many other code optimization frameworks (like cython) also integrate.

# Exercises 03.2

* 1) Create a random square matrix.  Multiply it by itself 1000 times and time it (`%%timeit`). Do it again, but use ufuncs to save memory copies.  Which is faster?

In [136]:
m = np.random.rand(100*100)
m.shape = 100,100
m = np.matrix(m)
m

matrix([[0.28067671, 0.86492039, 0.5973912 , ..., 0.95825485, 0.13739709,
         0.91516837],
        [0.80709206, 0.57639336, 0.41141094, ..., 0.05186933, 0.60220827,
         0.67651982],
        [0.02010334, 0.77026838, 0.99463797, ..., 0.01301657, 0.92505826,
         0.27197167],
        ...,
        [0.0803721 , 0.27101947, 0.33265028, ..., 0.33834412, 0.04641629,
         0.21948241],
        [0.44924789, 0.22071308, 0.07268721, ..., 0.40110738, 0.04305881,
         0.9174136 ],
        [0.51232263, 0.83637437, 0.10073697, ..., 0.39680906, 0.96005107,
         0.85182225]])

In [134]:
%%timeit
m2 = m
for i in range(1000):
    m2 = np.add(m2, m2)

8.38 ms ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [137]:
%%timeit
m2 = m
for i in range(1000):
    np.add(m2, m2, m2)

  This is separate from the ipykernel package so we can avoid doing imports until


8.6 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


* 2) Take the iris array from above (the one with the strings) and convert the first four columns to a new float array.

In [155]:
iris2 = np.array(iris[:,:4], dtype=float)

- 3) Compute mean of each column of the iris array (you need to check numpy documentations... or hint, they are ufuncs)

In [156]:
np.mean(iris2, axis=0)

array([5.84333333, 3.054     , 3.75866667, 1.19866667])

- 4) Subtract the means from each column

In [157]:
iris2 - np.mean(iris2, axis=0)

array([[-7.43333333e-01,  4.46000000e-01, -2.35866667e+00,
        -9.98666667e-01],
       [-9.43333333e-01, -5.40000000e-02, -2.35866667e+00,
        -9.98666667e-01],
       [-1.14333333e+00,  1.46000000e-01, -2.45866667e+00,
        -9.98666667e-01],
       [-1.24333333e+00,  4.60000000e-02, -2.25866667e+00,
        -9.98666667e-01],
       [-8.43333333e-01,  5.46000000e-01, -2.35866667e+00,
        -9.98666667e-01],
       [-4.43333333e-01,  8.46000000e-01, -2.05866667e+00,
        -7.98666667e-01],
       [-1.24333333e+00,  3.46000000e-01, -2.35866667e+00,
        -8.98666667e-01],
       [-8.43333333e-01,  3.46000000e-01, -2.25866667e+00,
        -9.98666667e-01],
       [-1.44333333e+00, -1.54000000e-01, -2.35866667e+00,
        -9.98666667e-01],
       [-9.43333333e-01,  4.60000000e-02, -2.25866667e+00,
        -1.09866667e+00],
       [-4.43333333e-01,  6.46000000e-01, -2.25866667e+00,
        -9.98666667e-01],
       [-1.04333333e+00,  3.46000000e-01, -2.15866667e+00,
      

* 5) Normalize each column to the range 0-1.  Don't process each column separately!

In [158]:
mins = np.min(iris2, axis=0)
maxs = np.max(iris2, axis=0)
range_ = maxs - mins
(iris2 - mins) / range_

array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667],
       [0.30555556, 0.79166667, 0.11864407, 0.125     ],
       [0.08333333, 0.58333333, 0.06779661, 0.08333333],
       [0.19444444, 0.58333333, 0.08474576, 0.04166667],
       [0.02777778, 0.375     , 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ],
       [0.30555556, 0.70833333, 0.08474576, 0.04166667],
       [0.13888889, 0.58333333, 0.10169492, 0.04166667],
       [0.13888889, 0.41666667, 0.06779661, 0.        ],
       [0.        , 0.41666667, 0.01694915, 0.        ],
       [0.41666667, 0.83333333, 0.03389831, 0.04166667],
       [0.38888889, 1.        , 0.08474576, 0.125     ],
       [0.30555556, 0.79166667, 0.05084746, 0.125     ],
       [0.22222222, 0.625     ,

* 6) Advanced: Filter rows of iris2 that have that have `petallength` (3rd column) > 1.5 and `sepallength` (1st column) < 5.0  (hint: conditional selection, also `True&True =True` is bitwise AND, which in our case can be used with the boolean array indexing)

In [170]:
iris2[:,2] > 1.5
iris2[:,0] < 5
where = (iris2[:,2] > 1.5) & (iris2[:,0] < 5)
iris2[where]

array([[4.8, 3.4, 1.6, 0.2],
       [4.8, 3.4, 1.9, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [4.9, 2.4, 3.3, 1. ],
       [4.9, 2.5, 4.5, 1.7]])