# Numpy
Numpy is the array package of Python. It isn't built-in, but is very well integrated to the Python ecosystem.

Numpy isn't matrices.  It isn't vectors.  It is `ndarray`s - n-dimensional arrays, so like most things in Python it is very flexible and can be used for almost anything.

`ndarray` is not a matrix - but there are more classes for that.

Numpy arrays are good for basic arithmetic and linear algebra.  For a more usable version for typical data analysis, we will look at the `pandas` data structures next

## Base numpy array properties
* N-dimensional
* Uniform data type (dtype)

## Creating arrays
There are many ways to create arrays, most importantly...
* `np.array`
* `np.asarray`
* Creator functions that return new data: `np.zeros`, `np.ones`, `np.arange`, `np.random.*`, etc.

In [1]:
import numpy as np

In [6]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [7]:
np.ones((3,4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [10]:
np.arange(12)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [19]:
a = np.arange(12)
a.shape = (3, 4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [23]:
np.array([1,2,3,4,5,6])

array([1, 2, 3, 4, 5, 6])

In [25]:
np.array([[9,8,7],[6,5,4],[3,2,1]])

array([[9, 8, 7],
       [6, 5, 4],
       [3, 2, 1]])

Different data types can be specified with `dtype=`.  This includes specification of arbitrary C datatypes (refer to docs to learn how to specify them)

In [26]:
np.array([[9,8,7],[6,5,4],[3,2,1]], dtype=float)

array([[9., 8., 7.],
       [6., 5., 4.],
       [3., 2., 1.]])

In [34]:
np.array(((9,8,7),(6,5,4),(3,2,1)), dtype=np.float32)

array([[9., 8., 7.],
       [6., 5., 4.],
       [3., 2., 1.]], dtype=float32)

# Slicing and views
* Numpy allows you to **slice** arrays, which basically means extract subsets.
* You can do this on any axis equivalently, and it is very powerful.
* Numpy by default tries to avoid copies - different views point to the same physical memory location.  Modifying one array modifies everything that references it.
* It can take some time to fully understand it, especially the N-dimensional nature of arrays!

In [29]:
a = np.arange(24)
a.shape = [6, 4]
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [30]:
# Slice the first axis
a[0]

array([0, 1, 2, 3])

In [39]:
# Slice the second axis.  How do you tell the axes apart?
a[:,0]

array([ 0,  4,  8, 12, 16, 20])

In [47]:
# Remember theb Python slicing syntax, start:stop:step?  Same applies here
# Example with Python lists:
l = list(range(10))
print(l)
print(l[5:])
print(l[:3])
print(l[::2])
print(l[-4:-2])
print(l[-2])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[5, 6, 7, 8, 9]
[0, 1, 2]
[0, 2, 4, 6, 8]
[6, 7]
8


In [41]:
# With arrays, it is the same:
a[:2]

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [42]:
a[:,:2]

array([[ 0,  1],
       [ 4,  5],
       [ 8,  9],
       [12, 13],
       [16, 17],
       [20, 21]])

In [43]:
a[1:3,1:3]

array([[ 5,  6],
       [ 9, 10]])

Remember we said that copies are views into the same data?  Having full control of copying is critical to any kind of efficient programming!

In [44]:
a2 = a.copy() # Force a copy.  np.array also does this.
a3 = a2[1:3]  # a3 is view to a2
a3[:] = 0     # edit a3
a2            # a2 changes, too!

array([[ 0,  1,  2,  3],
       [ 0,  0,  0,  0],
       [ 0,  0,  0,  0],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

## Advanced indexing
### Boolean indexing
You can do advanced slicing too, using boolean arrays.  Basically, doing comparison operations on an array gives you a boolean array of trues and falses.  If you use this to slice, it is basically used as bitmap to select things out.

This turns out to be used quite a lot in many cases for selection.

(Boolean indexing isn't the only method of advanced indexing)

Advanced: how does this relate to memory management?

In [49]:
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [52]:
# Only one True value
a > 5

array([[False, False, False, False],
       [False, False,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

In [54]:
a[a==1]

array([1])

In [53]:
# Note that dimension gets reduced
a[a > 5]

array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
       23])

In [55]:
# Every row, where the first element is a multiple of 3
a[a[:,0]%3 == 0]

array([[ 0,  1,  2,  3],
       [12, 13, 14, 15]])

In [56]:
# Do the same for columns.  Now we start getting to a "trial and error" point!
columns_to_select = a[0]%3 == 0
a[:, columns_to_select]

array([[ 0,  3],
       [ 4,  7],
       [ 8, 11],
       [12, 15],
       [16, 19],
       [20, 23]])

In [57]:
np.array?

[0;31mDocstring:[0m
array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)

Create an array.

Parameters
----------
object : array_like
    An array, any object exposing the array interface, an object whose
    __array__ method returns an array, or any (nested) sequence.
dtype : data-type, optional
    The desired data-type for the array.  If not given, then the type will
    be determined as the minimum type required to hold the objects in the
    sequence.  This argument can only be used to 'upcast' the array.  For
    downcasting, use the .astype(t) method.
copy : bool, optional
    If true (default), then the object is copied.  Otherwise, a copy will
    only be made if __array__ returns a copy, if obj is a nested sequence,
    or if a copy is needed to satisfy any of the other requirements
    (`dtype`, `order`, etc.).
order : {'K', 'A', 'C', 'F'}, optional
    Specify the memory layout of the array. If object is not an array, the
    newly created array will be i

# Numpy documentation tour

Numpy has good documentation: 
* User guide: https://docs.scipy.org/doc/numpy/user/index.html
* Reference guide: https://docs.scipy.org/doc/numpy/reference/index.html

One thing which is often left out is "how to read documentation", so take a brief tour of these.

Being able to use the reference guides is very valuable...

# Exercise 03.1

* 1) Create array of size 5x6 with sequential values (`np.arange`)

In [71]:
b = np.arange(5*6)
b.shape = 5, 6
b

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

* 2) How do you transpose the array?

In [72]:
b.T

array([[ 0,  6, 12, 18, 24],
       [ 1,  7, 13, 19, 25],
       [ 2,  8, 14, 20, 26],
       [ 3,  9, 15, 21, 27],
       [ 4, 10, 16, 22, 28],
       [ 5, 11, 17, 23, 29]])

- 3) Slice the top and rows away.

In [73]:
b[1:-1]

array([[ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23]])

- 4) replace every other element with 0.  Replace every third element with 1.

In [74]:
b[:,1::2] = 0
b[:,2::3] = 1
b

array([[ 0,  0,  1,  0,  4,  1],
       [ 6,  0,  1,  0, 10,  1],
       [12,  0,  1,  0, 16,  1],
       [18,  0,  1,  0, 22,  1],
       [24,  0,  1,  0, 28,  1]])

- 5) Create matrices and try multiplying them.  Create the same arrays and try multiplying.  What's the difference? (hint: `np.matrix`)

In [78]:
c = np.array([[1,2], [3,4]])
d = np.array([[5,6], [7,8]])
print(c)
print(d)
np.dot(c, d)


[[1 2]
 [3 4]]
[[5 6]
 [7 8]]


array([[19, 22],
       [43, 50]])

- 6) Bonus: try to understand the axis ordering system.  Can you find some logic to first and last axes, and how things are displayed?

- 7) Bonus: learn about advanced indexing and the `where` function.  Extract only multiples of three from your array.

# Basic array operations
* Of course, once you have arrays, you want to use them.

* The most basic type of function is `ufuncs`: "universal functions"
  * They take arrays or scalars as input
  * They operate elementwise
  * They are optimized functions written in C and are generally clever about how they do things?
  
* What's the benefit of ufuncs?  Universal, work on any data type.

* They engage in something called **broadcasting**: if the sizes don't match up, they are increased in size.

* They exist as `np.add`, etc, but normal operators `+` work too, where they exist.

* These functions provide **vectorization**: one operation operates on a lot of data at once.
  * If you ever start doing `for` loops, think: can I use vectorized operations?  A lot of the time, the answer is yes.

In [None]:
a = np.arange(24)
a.shape = 6, 4
a

In [None]:
a + 10

In [None]:
np.add(a, 10)    # explicit function call - relevant for more advanced functions

It's obvious that this adds 10.  What about something else?

In [None]:
np.add(a, [10,20,30,40])

In [None]:
np.add(a, [[10],[20],[30],[40],[50],[60]])

Ufuncs can take an output argument.  This avoids extra copying of data!  This is when you really want to pay attention when making many different copies.

In [None]:
b = a.copy()             # output array
np.multiply(a, a, b)
print(b)

In [None]:
np.multiply(b, 2, b)
b

Note about operations:

* Numpy, by default, always uses element-by-element operations.  For example, try to multiply two matrices together:  Matlab by default does matrix multiply.
* Why is this?  Consistency: all operations work the same no matter what you have.  Numpy is designed so that if you have a function built for normal Python objects, it will work for arrays as well giving you automatic vectorization.  If you have a function that works for arrays, it will work for arrays of higher dimensions too via broadcasting.
* Explicit matrix multiplication operator:  a @ b
* See also the `matrix` class and modules.

## Axiswise ufuncs
Ufuncs can apply across one axis only!

In [None]:
a

In [None]:
np.sum(a, axis=0)

In [None]:
np.sum(a, axis=1)

Other functions like this: `sum`, `prod`, `diff`, `nansum`, etc.

# Reading data: numpy limits
Here, we read a particular dataset called the "iris" dataset.  We don't do anything practical here, but use it as an example to show what works and doesn't.

In [None]:
# What's in the dataset: we see five columns, first three are numbers but last is text
open("../data/iris.data").readlines()[:3]

In [None]:
#Load with automatically guessed datatypes
iris = np.genfromtxt('../data/iris.data', delimiter=',')
iris

In [None]:
# Force Python object datatypes
iris = np.genfromtxt('../data/iris.data', delimiter=',', dtype=object)
iris

In the next part, we will show how pandas makes this much better.

# Numpy as a glue
* Numpy is the common interface between different extension libraries
* For example, if you  have a C code that takes arrays, you can construct numpy arrays and pass the pointers to your C functions.
* Common language of scipy, pandas, and so on...
* Many other code optimization frameworks (like cython) also integrate.

# Exercises 03.2

* 1) Create a random square matrix.  Multiply it by itself 1000 times and time it (`%%timeit`). Do it again, but use ufuncs to save memory copies.  Which is faster?

* 2) Take the iris array from above (the one with the strings) and convert the first four columns to a new float array.

- 3) Compute mean of each column of the iris array (you need to check numpy documentations... or hint, they are ufuncs)

- 4) Subtract the means from each column

* 5) Normalize each column to the range 0-1.  Don't process each column separately!

* 6) Advanced: Filter rows of iris2 that have that have `petallength` (3rd column) > 1.5 and `sepallength` (1st column) < 5.0  (hint: conditional selection, also `True&True =True` is bitwise AND, which in our case can be used with the boolean array indexing)