# Numpy
Numpy is the array package of Python. It isn't built-in, but is very well integrated to the Python ecosystem.

Numpy isn't matrices.  It isn't vectors.  It is `ndarray`s - n-dimensional arrays, so like most things in Python it is very flexible and can be used for almost anything.

`ndarray` is not a matrix - but there are more classes for that.

Numpy arrays are good for basic arithmetic and linear algebra.  For a more usable version for typical data analysis, we will look at the `pandas` data structures next

## Base numpy array properties
* N-dimensional
* Uniform data type (dtype)

## Creating arrays

In [187]:
import numpy as np

In [188]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [189]:
np.ones((3,4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [190]:
a = np.arange(12)
a.shape = 3, 4
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [191]:
np.array([1,2,3,4,5,6])

array([1, 2, 3, 4, 5, 6])

In [192]:
np.array([[9,8,7],[6,5,4],[3,2,1]])

array([[9, 8, 7],
       [6, 5, 4],
       [3, 2, 1]])

Different data types can be specified with `dtype=`.  This includes specification of arbitrary C datatypes (refer to docs to learn how to specify them)

In [193]:
np.array([[9,8,7],[6,5,4],[3,2,1]], dtype=float)

array([[9., 8., 7.],
       [6., 5., 4.],
       [3., 2., 1.]])

In [194]:
np.array([[9,8,7],[6,5,4],[3,2,1]], dtype=np.float32)

array([[9., 8., 7.],
       [6., 5., 4.],
       [3., 2., 1.]], dtype=float32)

# Slicing and views
* Numpy allows you to **slice** arrays, which basically means extract subsets.
* You can do this on any axis equivalently, and it is very powerful.
* Numpy by default tries to avoid copies - different views point to the same physical memory location.  Modifying one array modifies everything that references it

In [195]:
a = np.arange(24)
a.shape = 6, 4
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [196]:
# Slice the first axis
a[0]

array([0, 1, 2, 3])

In [197]:
# Slice the second axis.  How do you tell the axes apart?
a[:,0]

array([ 0,  4,  8, 12, 16, 20])

In [198]:
# Remember theb Python slicing syntax, start:stop:step?  Same applies here
# Example with Python lists:
l = list(range(10))
print(l)
print(l[5:])
print(l[:3])
print(l[::2])

TypeError: 'numpy.ndarray' object is not callable

In [199]:
# With arrays, it is the same:
a[:2]

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [200]:
a[:,:2]

array([[ 0,  1],
       [ 4,  5],
       [ 8,  9],
       [12, 13],
       [16, 17],
       [20, 21]])

In [201]:
a[1:3,1:3]

array([[ 5,  6],
       [ 9, 10]])

Remember we said that copies are views into the same data?

In [202]:
a2 = a.copy()
a3 = a2[1:3]  # a3 is view to a2
a3[:] = 0     # edit a3
a2            # a2 changes, too!

array([[ 0,  1,  2,  3],
       [ 0,  0,  0,  0],
       [ 0,  0,  0,  0],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

## Fancy slicing
You can do advanced slicing too, using boolean arrays.  Basically, doing comparison operations on an array gives you a boolean array of trues and falses.  If you use this to slice, then you get 

In [203]:
# Only one True value
a == 1

array([[False,  True, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False]])

In [204]:
a[a==1]

array([1])

In [205]:
# Note that dimension gets reduced
a[a > 5]

array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
       23])

In [206]:
# Every row, where the first element is a multiple of 3
a[a[:,0]%3 == 0]

array([[ 0,  1,  2,  3],
       [12, 13, 14, 15]])

In [207]:
# Do the same for columns.  Now we start getting to a "trial and error" point!
columns_to_select = a[0]%3 == 0
a[:, columns_to_select]

array([[ 0,  3],
       [ 4,  7],
       [ 8, 11],
       [12, 15],
       [16, 19],
       [20, 23]])

# Numpy documentation tour

Numpy has good documentation: 
* User guide: https://docs.scipy.org/doc/numpy/user/index.html
* Reference guide: https://docs.scipy.org/doc/numpy/reference/index.html

One thing which is often left out is "how to read documentation", so take a brief tour of these.

Being able to use the reference guides is very valuable...

# Exercise 03.1

* 1) create array of size 5x6 with sequential values

* 2) How do you transpose the array

- 3) Slice the top and bottom away.

- 4) replace every other element with 0

- 5) Create matrices and try multiplying them.  Create the same arrays and try multiplying.  What's the difference? (hint: `np.matrix`)

- 6) Bonus: try to understand the axis ordering system.  Can you find some logic to first and last axes, and how things are displayed?

- 7) Bonus: learn about advanced indexing and the `where` function.  Extract only multiples of three from your array.

# Basic array operations
* Of course, once you have arrays, you want to use them.

* The most basic type of function is `ufuncs`: "universal functions"
  * They take arrays or scalars as input
  * They operate elementwise
  * They are optimized functions written in C and are generally clever about how they do things?
  
* What's the benefit of ufuncs?  Universal, work on any data type

* These functions provide **vectorization**: one operation operates on a lot of data at once.
  * If you ever start doing `for` loops, think: can I use vectorized operations?  A lot of the time, the answer is yes.

In [208]:
a = np.arange(24)
a.shape = 6, 4
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [209]:
a + 10

array([[10, 11, 12, 13],
       [14, 15, 16, 17],
       [18, 19, 20, 21],
       [22, 23, 24, 25],
       [26, 27, 28, 29],
       [30, 31, 32, 33]])

In [210]:
np.add(a, 10)    # explicit function call - relevant for more advanced functions

array([[10, 11, 12, 13],
       [14, 15, 16, 17],
       [18, 19, 20, 21],
       [22, 23, 24, 25],
       [26, 27, 28, 29],
       [30, 31, 32, 33]])

It's obvious that this adds 10.  What about something else?

In [211]:
np.add(a, [10,20,30,40])

array([[10, 21, 32, 43],
       [14, 25, 36, 47],
       [18, 29, 40, 51],
       [22, 33, 44, 55],
       [26, 37, 48, 59],
       [30, 41, 52, 63]])

In [212]:
np.add(a, [[10],[20],[30],[40],[50],[60]])

array([[10, 11, 12, 13],
       [24, 25, 26, 27],
       [38, 39, 40, 41],
       [52, 53, 54, 55],
       [66, 67, 68, 69],
       [80, 81, 82, 83]])

Ufuncs can take an output argument.  This avoids extra copying of data!  This is when you really want to 

In [213]:
b = a.copy()
np.multiply(a, a, b)
print(b)

[[  0   1   4   9]
 [ 16  25  36  49]
 [ 64  81 100 121]
 [144 169 196 225]
 [256 289 324 361]
 [400 441 484 529]]


Note about operations:

* Numpy, by default, always uses element-by-element operations.  For example, try to multiply two matrices together:  Matlab by default does matrix multiply.
* Why is this?  Consistency: all operations work the same no matter what you have.  Numpy is designed so that if you have a function built for normal Python objects, it will work for arrays as well giving you automatic vectorization.  If you have a function that works for arrays, it will work for arrays of higher dimensions too via broadcasting.
* Explicit matrix multiplication operator:  a @ b
* See also the `matrix` class and modules.

## Axiswise ufuncs
Ufuncs can apply across one axis only!

In [214]:
np.sum(a, axis=0)

array([60, 66, 72, 78])

In [215]:
np.sum(a, axis=1)

array([ 6, 22, 38, 54, 70, 86])

Other functions like this: `sum`, `prod`, `diff`, `nansum`, etc.

# Reading data: numpy limits
Here, we read a particular dataset called the "iris" dataset

In [216]:
# What's in the dataset: we see five columns, first three are numbers but last is text
open("../data/iris.data").readlines()[:3]

['5.1,3.5,1.4,0.2,Iris-setosa\n',
 '4.9,3.0,1.4,0.2,Iris-setosa\n',
 '4.7,3.2,1.3,0.2,Iris-setosa\n']

In [217]:
iris = np.genfromtxt('../data/iris.data', delimiter=',',dtype=object)
iris

array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa'],
       [b'4.9', b'3.0', b'1.4', b'0.2', b'Iris-setosa'],
       [b'4.7', b'3.2', b'1.3', b'0.2', b'Iris-setosa'],
       [b'4.6', b'3.1', b'1.5', b'0.2', b'Iris-setosa'],
       [b'5.0', b'3.6', b'1.4', b'0.2', b'Iris-setosa'],
       [b'5.4', b'3.9', b'1.7', b'0.4', b'Iris-setosa'],
       [b'4.6', b'3.4', b'1.4', b'0.3', b'Iris-setosa'],
       [b'5.0', b'3.4', b'1.5', b'0.2', b'Iris-setosa'],
       [b'4.4', b'2.9', b'1.4', b'0.2', b'Iris-setosa'],
       [b'4.9', b'3.1', b'1.5', b'0.1', b'Iris-setosa'],
       [b'5.4', b'3.7', b'1.5', b'0.2', b'Iris-setosa'],
       [b'4.8', b'3.4', b'1.6', b'0.2', b'Iris-setosa'],
       [b'4.8', b'3.0', b'1.4', b'0.1', b'Iris-setosa'],
       [b'4.3', b'3.0', b'1.1', b'0.1', b'Iris-setosa'],
       [b'5.8', b'4.0', b'1.2', b'0.2', b'Iris-setosa'],
       [b'5.7', b'4.4', b'1.5', b'0.4', b'Iris-setosa'],
       [b'5.4', b'3.9', b'1.3', b'0.4', b'Iris-setosa'],
       [b'5.1', b'3.5', b'1.4',

In [218]:
#Load with automatically guessed datatypes
iris = np.genfromtxt('../data/iris.data', delimiter=',')
iris

array([[5.1, 3.5, 1.4, 0.2, nan],
       [4.9, 3. , 1.4, 0.2, nan],
       [4.7, 3.2, 1.3, 0.2, nan],
       [4.6, 3.1, 1.5, 0.2, nan],
       [5. , 3.6, 1.4, 0.2, nan],
       [5.4, 3.9, 1.7, 0.4, nan],
       [4.6, 3.4, 1.4, 0.3, nan],
       [5. , 3.4, 1.5, 0.2, nan],
       [4.4, 2.9, 1.4, 0.2, nan],
       [4.9, 3.1, 1.5, 0.1, nan],
       [5.4, 3.7, 1.5, 0.2, nan],
       [4.8, 3.4, 1.6, 0.2, nan],
       [4.8, 3. , 1.4, 0.1, nan],
       [4.3, 3. , 1.1, 0.1, nan],
       [5.8, 4. , 1.2, 0.2, nan],
       [5.7, 4.4, 1.5, 0.4, nan],
       [5.4, 3.9, 1.3, 0.4, nan],
       [5.1, 3.5, 1.4, 0.3, nan],
       [5.7, 3.8, 1.7, 0.3, nan],
       [5.1, 3.8, 1.5, 0.3, nan],
       [5.4, 3.4, 1.7, 0.2, nan],
       [5.1, 3.7, 1.5, 0.4, nan],
       [4.6, 3.6, 1. , 0.2, nan],
       [5.1, 3.3, 1.7, 0.5, nan],
       [4.8, 3.4, 1.9, 0.2, nan],
       [5. , 3. , 1.6, 0.2, nan],
       [5. , 3.4, 1.6, 0.4, nan],
       [5.2, 3.5, 1.5, 0.2, nan],
       [5.2, 3.4, 1.4, 0.2, nan],
       [4.7, 3

In the next part, we will show how pandas makes this much better.

# Numpy documentation tour

Numpy has good documentation: 
* User guide: https://docs.scipy.org/doc/numpy/user/index.html
* Reference guide: https://docs.scipy.org/doc/numpy/reference/index.html

One thing which is often left out is "how to read documentation", so take a brief tour of these.

Being able to use the reference guides is very valuable...

# Numpy as a glue
* Numpy is the common interface between different extension libraries
* For example, if you  have a C code that takes arrays, you can construct numpy arrays and pass the pointers to your C functions.
* Common language of scipy, pandas, and so on...
* Many other code optimization frameworks (like cython) also integrate.

# Exercises 03.2

* 1) Take the iris array from above (the one with the strings) and convert the first four columns to a new float array.

- 2) Compute mean of each column of the iris array (you need to check numpy documentations... or hint, they are ufuncs)

- 3) Subtract the means from each column

* 4) Normalize each column to the range 0-1.  Don't process each column separately!

* 5) Advanced: Filter rows of iris2 that have that have `petallength` (3rd column) > 1.5 and `sepallength` (1st column) < 5.0  (hint: conditional selection, also `True&True =True` is bitwise AND, which in our case can be used with the boolean array indexing)