# Python: numpy, and matplotlib

This is a tutorial on scientific Python for the [KIPAC computing boot camp](http://kipac.github.io/BootCamp).

Authors: [Yao-Yuan Mao](http://yymao.github.io), [Joe DeRose](https://github.com/j-dr), [Mike Baumer](https://mbaumer.github.io)

In [None]:
from __future__ import print_function, division
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display

## The Trouble with lists
Python lists are great, but have significant limitations in the context of scientific computing

In [None]:
a = [23,34,23,5]
a[1] #useful

In [None]:
a.append(3)
a #useful

In [None]:
2*a # not typically useful

In [None]:
a**2 #argh!

In [None]:
a*a #argh!

Dictionaries aren't much better for math:

In [None]:
b = {'entry1' : 34, 'entry2' : 45, 'entry3' : 23}

In [None]:
b*b

In [None]:
#also, lists are slow at mathematical calculations
def list_square(a):
    squared = []
    for entry in a:
        squared.append(entry**2)
    return squared

def array_square(a):
    return np.power(a,2)

In [None]:
big_list = 1000*[3]
big_array = np.array(big_list)
%timeit list_square(big_list)
%timeit array_square(big_array)

Numpy and matplotlib are the workhorse packages of the scientist using python. Efficient use of numpy is the only way to perform most operations in a tractable amount of time in python. Once you've performed those operations, matplotlib is the go to package for visualization.

## Outline
* **Numpy**
    * Creating arrays  
    * Manipulating arrays  
    * Under the hood
    * Common array operations
    * Useful functions
* **Matplotlib**

## Numpy
Why use numpy instead of regular old python?

**python** has built-in:   
containers: lists (costless insertion and append), dictionaries (fast lookup),...  
high-level number objects: integers, floating point


**numpy** is:   
an extension package to Python for multidimensional arrays   
closer to hardware (efficiency)   
designed for scientific computation (convenience)   


### Creating arrays
The fundamental object in numpy is the ndarray. There are many different ways to create an array, as outlined below.

In [None]:
#an array of all zeros (pass in the shape of the array you want)
shape = (10,10)
z = np.ndarray(shape)
print('The shape of this array is {0}'.format(z.shape))
display(z)

In [None]:
#an array of zeros
z = np.zeros(shape)
display(z)

In [None]:
#A range of integers
a = np.arange(100)
display(a)

In [None]:
#An array with numbers at linearly spaced intervals
l = np.linspace(0,5,100)
display(l)

In [None]:
#An array with numbers at log spaced intervals
l = np.logspace(1,5,100)
display(l)

In [None]:
#1-d array creation from a list
l = range(100)
print(type(l))
print(l)
la = np.array(l)
print(type(la))
display(la)

In [None]:
#2-d array creation from a list
l = [[0,1,2], [3,4,5]]
la = np.array(l)
display(la)
print(la.shape)

In [None]:
#3-d array
c = np.array([[[1], [2]], [[3], [4]]])
display(c)
print(c.shape)

### Data types

The type of the elements in an ndarray are specified in the array's dtype attribute. Numpy arrays can contain more than one type of data, but the types must be specified via the dtype!

In [None]:
#these are different arrays!
a = np.array([1, 2, 3])
print('This array has type {0}'.format(a.dtype))


b = np.array([1., 2., 3.])
print('This array has type {0}'.format(b.dtype))


Type can be specified

In [None]:
a = np.array([0.,1.,2.], dtype=int)
print(a.dtype)

#default dtype is float
a = np.zeros(2)
print(a.dtype)

You can have more than one type in an array, but it must be done in a particular fasion, and the type must have a predictable size in memory.

In [None]:
#can't do this
z = np.ones(3)
print(z.dtype)
z[0] = 1.
z[1] = 1
z[2] = 'h'
display(z)

In [None]:
#but can do it using more complex dtypes
dt = np.dtype([('c1', np.float), ('c2', np.int), ('c3', 'S140')])
z = np.ones(3, dtype=dt)
print(z.dtype)
print(z.shape)
display(z)

This type of array is called a record array. A few things to note:

* It is still considered a 1 dimensional array, even though it has more than 1 column.
* Accessing elements of a record array is a bit different.
* We are required to name the different fields and define their datatypes.

In [None]:
#check the names of the fields
z.dtype.names

In [None]:
#accessing all rows of a particular field yields a normal array
print(z['c1'])
print(z['c1'].dtype)

In [None]:
#accessing a row
print(z[0])
print(z[0].dtype)

The size of the variable being assigned must be pre determined. We can't put strings longer than the length that was specified in the dtype!

In [None]:
z['c3'][0] = 150*'x'
len(z['c3'][0]) #whoops!

### Exercise 1

Create an array that can store strings of length at most 200 as well as booleans. 

### Indexing and slicing
Numpy arrays support all the same indexing and slicing operations that standard python lists do. Numpy arrays are C ordered (the rows are indexed by the first index, columns by the second.

In [None]:
a = np.arange(100)
#numpy arrays are zero indexed
a[0], a[4], a[10]

In [None]:
#slicing
a[0:10:2] #syntax is same as for lists start:stop:step

In [None]:
a.shape

In [None]:
#indices of multi dimensional arrays are tuples of integers
x = a.reshape(10,10)
display(x)
print(x[1,0])
print(x[0,1])

### Exercise 2
Create a 8x8 matrix and fill it with a checkerboard pattern

### Fancy Indexing
In addition to the standard list indexing and slicing operations, we can also use arrays and lists of integers or bools to index numpy arrays.

In [None]:
#you can also use lists or other numpy arrays to index
x = np.arange(200)
x[[10,20,45]]

In [None]:
#when indexing with arrays of ints, the output takes the shape of the index
a = np.arange(10)
idx = np.array([[3, 4], [9, 7]])
print(a[idx])
print(a[idx].shape)

In [None]:
#we can also use boolean masks to index arrays
np.random.seed(3)
a = np.random.randint(0, 20, 15)
print(a)
print(a % 3 == 0)
mask = (a % 3 == 0)
extract_from_a = a[mask] # or,  a[a%3==0]
extract_from_a 

But indexing like this is much slower. We will see why in a moment.

In [None]:
x = np.arange(10000)

In [None]:
%timeit x[range(10000)[::100]]
%timeit x[::100]

### Exercise 3
Create a 10 × 5 array with random numbers between 0 and 1. Construct the (one-dimensional) array returning the values of the array closest to 0.66 for each row using fancy indexing. Do the same for the columns.

**Hint**: You'll need to use the functions np.argmin and np.abs. Look up how they work!

### Under the hood

In order to understand why certain operations are more efficient than others in numpy, we need to understand views and copying of arrays. Below we perform two different operations that do the same thing, one much slower than the other. 

It is also very important to know what operations produce views in order to avoid hard to find bugs!

In [None]:
from IPython.display import Image
Image(filename='ndarray_layout.png')

In [None]:
x.data

In [None]:
print(x.strides)
print(x.dtype)

In [None]:
x = np.zeros(100, dtype=np.dtype([('a', np.float), ('b', 'S100')]))

In [None]:
print(x.strides)
print(x.dtype)

Depending on how we index an array, either a new array will be created by copying the old array, or we will get a 'view'

In [None]:
x = np.arange(10000)

In [None]:
fa = x[range(10000)[::100]]
sa = x[::100]

In [None]:
fa.base is x

In [None]:
#If an array does not own it's own data, we say it is a view of an array
sa.base is x

In [None]:
#for instance, transposes of arrays are actually views
sa.T.flags

In [None]:
#if an array owns its own data, its base will be None
print(fa.base)

In [None]:
a = np.arange(100).reshape(10,10)

### Exercise 4
Does the following two operations result in a views or a copies?

In [None]:
at = a.T
af = a.T.ravel()

### Array operations

### broadcasting
Broadcasting is a way of performing operations on numpy arrays of different shapes. 

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when

* they are equal, or
* one of them is 1

If these conditions are not met, an ```ValueError```  is thrown

In [None]:
a = np.arange(40).reshape(5, 8)
display(a)

In [None]:
a = a + 1

In [None]:
display(a)

In [None]:
a += 1
display(a)

In [None]:
b = np.arange(-1, -9, -1)
display(b)

In [None]:
a + b

In [None]:
c = np.array([10,20,30,40,50])
display(c)

In [None]:
a + c  # this would raise an error

In [None]:
a += c[:,np.newaxis]

In [None]:
a

### reduce

In [None]:
a.sum()

In [None]:
a.mean()

In [None]:
a.sum(axis=1)

In [None]:
a.max(axis=0)

In [None]:
np.median(a, axis=1)

In [None]:
np.std(a, axis=0)

The first rule of numpy: **AVOID LARGE FOR LOOPS** (instead, use [ufuncs](http://docs.scipy.org/doc/numpy/reference/ufuncs.html))

In [None]:
a = np.arange(1000)
%timeit a**2
%timeit [i**2 for i in a] #these yield identical answers

### searching and sorting

In [None]:
a = np.array([[4, 3, 5], [1, 2, 1]]) 
display(a)
b = np.sort(a, axis=1) #sorts each column independently; to sort row entries together use pandas
b

In [None]:
#but can do it using more complex dtypes
dt = np.dtype([('c1', np.float), ('c2', np.float), ('c3', np.float)])
z = np.random.rand(3,3)
z.dtype = dt
z

In [None]:
indices = np.argsort(z,axis=0,order='c1')
z[indices,:]

In [None]:
indices = np.argsort(z,axis=0,order='c2')
z[indices,:] # again, pandas is easier here.

In [None]:
#once sorted, can use fast search algorithms
x = np.random.randint(0, high=100, size=1000)
x.sort()
i = x.searchsorted(27)
print(i)

### Exercise 5
Write a function which sorts a 2d array by it's nth column.

### useful functions

In [None]:
a = np.arange(15).reshape(5,3)
b = np.arange(12).reshape(3,4)

display(a)
display(b)

In [None]:
np.dot(a, b)

In [None]:
a = np.arange(30).reshape(5,3,2)
b = np.arange(60).reshape(3,4,5)

display(a)
display(b)

In [None]:
np.einsum('ijk,jli', a, b)

In [None]:
# You can find most mathematical function. 
# For special functions, find them in `scipy.special`

x = np.linspace(0, np.pi*2, 101)
plt.plot(x, np.cos(x))
plt.plot(x, np.sin(x))

In [None]:
# dealing with bool array

a = np.random.randint(2, size=20).astype(bool)
display(a)
display(np.count_nonzero(a))
display(np.where(a))

In [None]:
# argsort

a = np.random.rand(10)

display(a)
display(a.argsort())

display(a[a.argsort()])

# note: to sort "in place", just do a.sort()

In [None]:
# argmax and unravel_index

a = np.random.rand(100)
display(a.argmax())

a = np.random.rand(100, 100)
display(a.argmax())
display(np.unravel_index(a.argmax(), a.shape))

In [None]:
# different ways to do histogram

a = np.random.rand(500)
bins = np.linspace(0, 1, 21)

display(np.bincount(np.searchsorted(bins, a)))

display(np.histogram(a, bins))

display(np.searchsorted(a, bins, sorter=a.argsort()))

### Task 

Given an 2-D int array, fill the 0th row and 0th column with 1, and the rest by the following rule:

    a[i,j] = a[i-1,j] + a[i,j-1]

In [None]:
# run this cell to see hints

hint = 'Pqv|{B\x125(_wzs(wv(wvm(zw\x7f(i|(i(|qum6(\x125(aw}/tt(vmml(i(v}ux\x81(n}vk|qwv6(Vw|({}zm(\x7fpqkp(wvmG(\\z\x81(z}v(hvx6twwsnwz0/k}u}ti|q~m/1h6'
print((np.array(map(ord, hint), np.int8)-8).tostring())

In [None]:
np.lookfor('cumulative')

## matplotlib

The easiest way to learn is to look at the [gallery](http://matplotlib.org/gallery.html)

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=2)
ax1, ax2, ax3, ax4 = axes.flat

# scatter plot (Note: `plt.scatter` doesn't use default colors)
x, y = np.random.normal(size=(2, 200))
ax1.plot(x, y, 'o')

# sinusoidal lines with colors from default color cycle
L = 2*np.pi
x = np.linspace(0, L)
ncolors = len(plt.rcParams['axes.color_cycle'])
shift = np.linspace(0, L, ncolors, endpoint=False)
for s in shift:
    ax2.plot(x, np.sin(x + s), '-')
ax2.margins(0)

# bar graphs
x = np.arange(5)
y1, y2 = np.random.randint(1, 25, size=(2, 5))
width = 0.25
ax3.bar(x, y1, width)
ax3.bar(x+width, y2, width, color=plt.rcParams['axes.color_cycle'][2])
ax3.set_xticks(x+width)
ax3.set_xticklabels(['a', 'b', 'c', 'd', 'e'])

# circles with colors from default color cycle
for color in plt.rcParams['axes.color_cycle']:
    ax4.add_patch(plt.Circle(np.random.randn(2), radius=0.3, color=color))
ax4.axis('equal')

plt.tight_layout()
plt.show()

## Exercise Solutions
### Exercise 1

In [None]:
x = np.zeros(100, dtype=np.dtype([('stringcol', 'S200'), ('boolcol', bool)]))

### Exercise 2

In [None]:
x = np.zeros((8,8))
x[1::2, ::2] = 1
x[::2, 1::2] = 1
print(x)

### Exercise 3

In [None]:
z = np.random.rand(10,5)
x = z[np.arange(10),np.argmin(np.abs(z-0.66), axis=1)]

### Exercise 4

1. View
2. Copy

### Exercise 5

In [None]:
#slow
def maxn(x, n):
    return x[np.argsort(x)[-n:]]

#fast
def maxn(x, n):
    return x[np.argpartition(-1*x,n)[:n]]

maxn(np.random.randint(5,1000, 1000), 5)

### Exercise 6

In [None]:
a = np.zeros((6, 6), int)

for i, row in enumerate(a):
    if i ==0 :
        row[:] = 1
    else:
        row[:] = np.cumsum(a[i-1])