# <img width=400 src="http://www.numpy.org/_static/numpy_logo.png" alt="Numpy"/>


## Why do we need numpy?

* You may have heard "Python is slow", this is true when it concerns looping over many small python objects
* Python is dynamically typed and everything is an object, even an `int`. There are no primitive types.
* Numpy's main feature is the `ndarray` class, a fixed length, homogeniously typed array class.
* Numpy implements a lot of functionality in fast c, cython and fortran code to work on these arrays
* python with vectorized operations using numpy can be blazingly fast

See: [Python is not C](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C?lang=en)

In [1]:
import numpy as np

## Small example timings

In [2]:
import math


def var(data):
    '''
    knuth's algorithm for one-pass calculation of the variance
    Avoids rounding errors of large numbers when doing the naive
    approach of `sum(v**2 for v in data) - sum(v)**2`
    '''
    
    n = 0
    mean = 0.0
    m2 = 0.0
    
    if len(data) < 2:
        return float('nan')

    for value in data:
        n += 1
        delta = value - mean
        mean += delta / n
        delta2 = value - mean
        m2 += delta * delta2

    return m2 / n 

In [80]:
%%timeit

l = list(range(1000))
var(l)

237 µs ± 599 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [79]:
%%timeit

a = np.arange(1000)  # array with numbers 0,...,999

np.var(a)

29 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## Basic math: vectorized

Operations on numpy arrays work vectorized, element-by-element

** Lose your loops **

In [5]:
# create a numpy array from a python a python list
a = np.array([1.0, 3.5, 7.1, 4, 6])

In [6]:
2 * a

array([  2. ,   7. ,  14.2,   8. ,  12. ])

In [7]:
a**2

array([  1.  ,  12.25,  50.41,  16.  ,  36.  ])

In [8]:
a**a

array([  1.00000000e+00,   8.02117802e+01,   1.10645633e+06,
         2.56000000e+02,   4.66560000e+04])

In [9]:
np.cos(a)

array([ 0.54030231, -0.93645669,  0.68454667, -0.65364362,  0.96017029])

**Attention: You need the `cos` from numpy!**

In [10]:
math.cos(a)

TypeError: only length-1 arrays can be converted to Python scalars

Most normal python functions with basic operators like `*`, `+`, `**` simply work because
of operator overloading:

In [11]:
def poly(x):
    return x + 2 * x**2 - x**3

poly(a)

array([   2.   ,  -14.875, -249.991,  -28.   , -138.   ])

In [12]:
poly(np.pi)

-8.125475224531307

## Useful properties

In [13]:
len(a)

5

In [14]:
a.shape

(5,)

In [15]:
a.dtype

dtype('float64')

In [16]:
a.ndim

1

## Arbitrary dimension arrays

In [17]:
# two-dimensional array
y = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

y + y

array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

In [18]:
## since python 3.5 @ is matrix product
y @ y

array([[ 30,  36,  42],
       [ 66,  81,  96],
       [102, 126, 150]])

In [19]:
# Broadcasting, changing array dimensions to fit the larger one

y + np.array([1, 2, 3])

array([[ 2,  4,  6],
       [ 5,  7,  9],
       [ 8, 10, 12]])

## Helpers for creating arrays

In [20]:
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [21]:
np.ones((5, 2))

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

In [22]:
np.full(5, np.nan)

array([ nan,  nan,  nan,  nan,  nan])

In [23]:
np.empty(5)  # attention, uninitialised memory, be carefull

array([ 0.,  0.,  0.,  0.,  0.])

In [24]:
np.linspace(0, 1, 11)

array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ])

In [25]:
# like range() for arrays:
np.arange(0, 10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [26]:
np.logspace(-4, 5, 10)

array([  1.00000000e-04,   1.00000000e-03,   1.00000000e-02,
         1.00000000e-01,   1.00000000e+00,   1.00000000e+01,
         1.00000000e+02,   1.00000000e+03,   1.00000000e+04,
         1.00000000e+05])

## Numpy Indexing

* Element access
* Slicing

In [27]:
x = np.arange(0, 10)

# like lists:
x[4]

4

In [28]:
# all elements with indices ≥1 and <4:
x[1:4]

array([1, 2, 3])

In [29]:
# negative indices count from the end
x[-1], x[-2]

(9, 8)

In [30]:
# combination:
x[3:-2]

array([3, 4, 5, 6, 7])

In [31]:
# step size
x[::2]

array([0, 2, 4, 6, 8])

In [32]:
# trick for reversal: negative step
x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [33]:
y = np.array([x, x + 10, x + 20, x + 30])
y

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]])

In [34]:
# comma between indices
y[3, 2:-1]

array([32, 33, 34, 35, 36, 37, 38])

In [35]:
# only one index ⇒ one-dimensional array
y[2]

array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In [36]:
# other axis: (: alone means the whole axis)
y[:, 3]

array([ 3, 13, 23, 33])

In [37]:
# inspecting the number of elements per axis:
y.shape

(4, 10)

In [None]:
## Exercise

# Changing array content

In [38]:
y

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]])

In [39]:
y[:, 3] = 0
y

array([[ 0,  1,  2,  0,  4,  5,  6,  7,  8,  9],
       [10, 11, 12,  0, 14, 15, 16, 17, 18, 19],
       [20, 21, 22,  0, 24, 25, 26, 27, 28, 29],
       [30, 31, 32,  0, 34, 35, 36, 37, 38, 39]])

Using slices on both sides

In [40]:
y[:,0] = x[3:7]
y

array([[ 3,  1,  2,  0,  4,  5,  6,  7,  8,  9],
       [ 4, 11, 12,  0, 14, 15, 16, 17, 18, 19],
       [ 5, 21, 22,  0, 24, 25, 26, 27, 28, 29],
       [ 6, 31, 32,  0, 34, 35, 36, 37, 38, 39]])

Transposing inverts the order of the dimensions

In [41]:
y

array([[ 3,  1,  2,  0,  4,  5,  6,  7,  8,  9],
       [ 4, 11, 12,  0, 14, 15, 16, 17, 18, 19],
       [ 5, 21, 22,  0, 24, 25, 26, 27, 28, 29],
       [ 6, 31, 32,  0, 34, 35, 36, 37, 38, 39]])

In [42]:
y.shape

(4, 10)

In [43]:
y.T

array([[ 3,  4,  5,  6],
       [ 1, 11, 21, 31],
       [ 2, 12, 22, 32],
       [ 0,  0,  0,  0],
       [ 4, 14, 24, 34],
       [ 5, 15, 25, 35],
       [ 6, 16, 26, 36],
       [ 7, 17, 27, 37],
       [ 8, 18, 28, 38],
       [ 9, 19, 29, 39]])

In [44]:
y.T.shape

(10, 4)

# Masks

* A boolean array can be used to select only the element where it contains `True`.
* Very powerfull tool to select certain elements that fullfill a certain condition

In [45]:
a = np.linspace(0, 2, 11)
b = np.random.normal(0, 1, 11)

print(b >= 0)
print(a[b >= 0])

[ True False  True  True False False False False False False  True]
[ 0.   0.4  0.6  2. ]


In [46]:
a[b < 0] = 0
a

array([ 0. ,  0. ,  0.4,  0.6,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  2. ])

## Reduction operations

Numpy has many operations, which reduce dimensionality of arrays

In [47]:
x = np.random.normal(0, 1, 10)

In [48]:
np.sum(x)

2.639281131161515

In [49]:
np.prod(x)

-0.0075133961408492787

In [50]:
np.mean(x)

0.2639281131161515

Standard Deviation

In [51]:
np.std(x)

1.1526527922074945

Standard error of the mean

In [52]:
np.std(x, ddof=1) / np.sqrt(len(x))

0.38421759740249811

Sample Standard Deviation

In [53]:
np.std(x, ddof=1)

1.2150027249094881

Difference between neighbor elements

In [54]:
z = np.arange(10)**2
np.diff(z)

array([ 1,  3,  5,  7,  9, 11, 13, 15, 17])

### Reductions on multi-dimensional arrays


In [55]:
array2d = np.arange(20).reshape(4, 5)

array2d

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [56]:
np.sum(array2d, axis=0)

array([30, 34, 38, 42, 46])

In [57]:
np.mean(array2d, axis=1)

array([  2.,   7.,  12.,  17.])

### Random numbers

* numpy has a larger number of distributions builtin

In [58]:
np.random.uniform(0, 1, 5)

array([ 0.49106545,  0.34141079,  0.74505641,  0.78889551,  0.1581187 ])

In [59]:
np.random.normal(5, 10, 5)

array([ 13.92607788,  21.42316881,  -6.63054261, -12.26313773,  -8.90668759])

## Calculating pi through monte-carlo simulation

* We draw random numbers in a square with length of the sides of 2
* We count the points which are inside the circle of radius 1

The area of the square is

$$
A_\mathrm{square} = a^2 = 4
$$

The area of the circle is
$$
A_\mathrm{circle} = \pi r^2 = \pi
$$

With 
$$
\frac{n_\mathrm{circle}}{n_\mathrm{square}} = \frac{A_\mathrm{circle}}{A_\mathrm{square}}
$$
We can calculate pi:

$$
\pi = 4 \frac{n_\mathrm{circle}}{n_\mathrm{square}}
$$

In [60]:
n_square = 1000

x = np.random.uniform(-1, 1, n_square)
y = np.random.uniform(-1, 1, n_square)

radius = np.sqrt(x**2 + y**2)

n_circle = np.sum(radius <= 1.0)

print(4 * n_circle / n_square)

3.136


## Exercise 1

1. Draw 10000 gaussian random numbers with mean of $\mu = 2$ and standard deviation of $\sigma = 3$
2. Calculate the mean and the standard deviation of the sample
3. What percentage of the numbers are outside of $[\mu - \sigma, \mu + \sigma]$?
4. How many of the numbers are $> 0$?
5. Calculate the mean and the standard deviation of all numbers ${} > 0$

In [61]:
numbers = np.random.normal(2, 3, 10000)

In [62]:
np.mean(numbers), np.std(numbers)

(1.9483486120052589, 3.0557199214513675)

In [63]:
mask = np.logical_or(numbers <= -1, numbers >= 5)

len(numbers[mask]) / len(numbers)

0.3291

In [64]:
mask = numbers >= 0

len(numbers[mask]), np.std(numbers[mask])

(7356, 2.2064388543070308)

## Exercise 2

Monte-Carlo uncertainty propagation

* The hubble constant as measured by PLANCK is
$$
H_0 = (67.74 \pm 0.47)\,\frac{\mathrm{km}}{\mathrm{s}\cdot\mathrm{Mpc}}
$$

* Estimate mean and the uncertainty of the velocity of a galaxy which is measured to be $(500 \pm 100)\,\mathrm{Mpc}$ away
using monte carlo methods

In [65]:
n = 100000
h0 = np.random.normal(57.74, 0.47, n)
distance = np.random.normal(500, 100, n)

velocity = np.mean(h0 * distance)
velocity_unc = np.std(h0 * distance)

print('({:.0f} ± {:.0f}) km/s'.format(velocity, velocity_unc))

(28876 ± 5780) km/s


## Simple io functions

In [66]:
idx = np.arange(100)
x = np.random.normal(0, 1, 100)
y = np.random.normal(0, 1, 100)
n = np.random.poisson(20, 100)

In [67]:
idx.shape, x.shape, y.shape, n.shape

((100,), (100,), (100,), (100,))

In [68]:
np.savetxt(
    'data.csv',
    np.column_stack([idx, x, y, n]),
)

In [69]:
!head data.csv

0.000000000000000000e+00 6.468089138136101646e-01 -4.912455165809954671e-01 2.200000000000000000e+01
1.000000000000000000e+00 6.099879755400224035e-03 1.304266646175405997e+00 1.300000000000000000e+01
2.000000000000000000e+00 -1.876245406315537878e-01 -6.674541418890934663e-01 1.900000000000000000e+01
3.000000000000000000e+00 -1.079906582000154103e+00 -1.152137985739347004e+00 1.900000000000000000e+01
4.000000000000000000e+00 1.749771264466803533e+00 -1.721445279692863872e+00 1.800000000000000000e+01
5.000000000000000000e+00 1.666057673201401634e+00 4.798473261677179136e-01 2.700000000000000000e+01
6.000000000000000000e+00 3.041161405423926101e-01 -7.633089525778424811e-02 1.800000000000000000e+01
7.000000000000000000e+00 -1.226854676743246886e+00 -7.641236601979406462e-01 1.800000000000000000e+01
8.000000000000000000e+00 5.338553873945026673e-01 -1.090471451818261439e+00 1.900000000000000000e+01
9.000000000000000000e+00 -9.466091353217070958e-01 -3.679148987165913876e-01 2.50

In [70]:
# Load back the data, unpack=True is needed to read the data columnwise and not row-wise
idx, x, y, n = np.genfromtxt('data.csv', unpack=True)

idx.dtype, x.dtype

(dtype('float64'), dtype('float64'))

### Problems

* Everything is a float
* Way larger file than necessary because of too much digits for floats
* No column names

## Numpy recarrays

* Numpy recarrays can store columns of different types
* Rows are addressed by integer index
* Columns are addressed by strings

Solution for our io problem → Column names, different types

In [71]:
data = np.savetxt(
    'data.csv',
    np.column_stack([idx, x, y, n]),
    delimiter=',', # true csv file
    header=','.join(['idx', 'x', 'y', 'n']),
    fmt=['%d', '%.4g', '%.4g', '%d'],
)

In [72]:
!head data.csv

# idx,x,y,n
0,0.6468,-0.4912,22
1,0.0061,1.304,13
2,-0.1876,-0.6675,19
3,-1.08,-1.152,19
4,1.75,-1.721,18
5,1.666,0.4798,27
6,0.3041,-0.07633,18
7,-1.227,-0.7641,18
8,0.5339,-1.09,19


In [73]:
data = np.genfromtxt(
    'data.csv',
    names=True, # load column names from first row
    dtype=None, # Automagically determince best data type for each column
    delimiter=',',
)

In [74]:
data[0]

(0,  0.6468, -0.4912, 22)

In [75]:
data['x']

array([ 0.6468 ,  0.0061 , -0.1876 , -1.08   ,  1.75   ,  1.666  ,
        0.3041 , -1.227  ,  0.5339 , -0.9466 , -0.7763 , -1.262  ,
        0.1387 ,  0.7034 ,  2.759  ,  1.095  , -0.8748 ,  0.5248 ,
       -1.293  , -0.1479 , -0.05448, -0.4313 ,  1.986  ,  1.007  ,
        0.5639 ,  0.8668 ,  0.3396 ,  1.395  , -0.5083 ,  0.2222 ,
       -0.8083 ,  0.1556 , -1.186  , -1.027  ,  0.2496 , -1.874  ,
        0.9063 , -0.953  , -2.449  ,  1.062  , -1.77   , -1.358  ,
       -0.5824 ,  1.761  , -1.061  , -0.2251 ,  2.008  , -1.423  ,
       -0.4405 , -2.013  ,  1.867  ,  0.2681 , -0.4608 ,  2.733  ,
        0.6063 , -0.6289 ,  0.8423 ,  0.6459 , -0.1953 , -0.05911,
       -0.5392 , -1.512  , -1.275  ,  0.6998 , -0.3023 , -0.8536 ,
       -0.4017 ,  1.052  ,  0.3916 ,  0.2685 , -0.7413 ,  1.015  ,
       -0.5434 ,  1.872  , -0.4297 ,  0.3171 , -1.235  , -2.063  ,
        1.571  , -0.1108 , -0.4851 , -1.421  ,  1.129  , -1.198  ,
       -2.283  , -0.06186, -0.5891 ,  0.294  , -0.5567 ,  0.22

In [76]:
data.dtype

dtype([('idx', '<i8'), ('x', '<f8'), ('y', '<f8'), ('n', '<i8')])

## Linear algebra

Numpy offers a lot of linear algebra functionality, mostly wrapping LAPACK

In [94]:
# symmetrix matrix, use eigh
mat = np.array([
    [4, 2, 0],
    [2, 1, -3],
    [0, -3, 4]
])

eig_vals, eig_vecs = np.linalg.eig(mat)

eig_vals, eig_vecs

(array([-1.40512484,  4.        ,  6.40512484]),
 array([[  3.07818468e-01,   8.32050294e-01,  -4.61454330e-01],
        [ -8.31898624e-01,  -1.93604245e-16,  -5.54927635e-01],
        [ -4.61727702e-01,   5.54700196e-01,   6.92181495e-01]]))

In [97]:
np.linalg.inv(mat)

array([[ 0.13888889,  0.22222222,  0.16666667],
       [ 0.22222222, -0.44444444, -0.33333333],
       [ 0.16666667, -0.33333333,  0.        ]])

## Numpy matrices

Numpy also has a matrix class, with operator overloading suited for matrices

In [99]:
mat = np.matrix(mat)

In [100]:
mat.T

matrix([[ 4,  2,  0],
        [ 2,  1, -3],
        [ 0, -3,  4]])

In [101]:
mat * mat

matrix([[ 20,  10,  -6],
        [ 10,  14, -15],
        [ -6, -15,  25]])

In [102]:
mat * 5

matrix([[ 20,  10,   0],
        [ 10,   5, -15],
        [  0, -15,  20]])

In [103]:
mat.I

matrix([[ 0.13888889,  0.22222222,  0.16666667],
        [ 0.22222222, -0.44444444, -0.33333333],
        [ 0.16666667, -0.33333333,  0.        ]])

In [105]:
mat * np.matrix([1, 2, 3]).T

matrix([[ 8],
        [-5],
        [ 6]])