# <img width=400 src="http://www.numpy.org/_static/numpy_logo.png" alt="Numpy"/>


## Why do we need numpy?

* You may have heard "Python is slow", this is true when it concerns looping over many small python objects
* Python is dynamically typed and everything is an object, even an `int`. There are no primitive types.
* Numpy's main feature is the `ndarray` class, a fixed length, homogeniously typed array class.
* Numpy implements a lot of functionality in fast c, cython and fortran code to work on these arrays
* python with vectorized operations using numpy can be blazingly fast

See: [Python is not C](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C?lang=en)

In [None]:
import numpy as np

## Small example timings

In [None]:
import math


def var(data):
    '''
    knuth's algorithm for one-pass calculation of the variance
    Avoids rounding errors of large numbers when doing the naive
    approach of `sum(v**2 for v in data) - sum(v)**2`
    '''
    
    n = 0
    mean = 0.0
    m2 = 0.0
    
    if len(data) < 2:
        return float('nan')

    for value in data:
        n += 1
        delta = value - mean
        mean += delta / n
        delta2 = value - mean
        m2 += delta * delta2

    return m2 / n 

In [3]:
%%timeit

l = list(range(1000))
var(l)

242 µs ± 2.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [4]:
%%timeit

a = np.arange(1000)  # array with numbers 0,...,999

np.var(a)

29.8 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## Basic math: vectorized

Operations on numpy arrays work vectorized, element-by-element

** Lose your loops **

In [5]:
# create a numpy array from a python a python list
a = np.array([1.0, 3.5, 7.1, 4, 6])

In [6]:
2 * a

array([  2. ,   7. ,  14.2,   8. ,  12. ])

In [7]:
a**2

array([  1.  ,  12.25,  50.41,  16.  ,  36.  ])

In [8]:
a**a

array([  1.00000000e+00,   8.02117802e+01,   1.10645633e+06,
         2.56000000e+02,   4.66560000e+04])

In [9]:
np.cos(a)

array([ 0.54030231, -0.93645669,  0.68454667, -0.65364362,  0.96017029])

**Attention: You need the `cos` from numpy!**

In [10]:
math.cos(a)

TypeError: only length-1 arrays can be converted to Python scalars

Most normal python functions with basic operators like `*`, `+`, `**` simply work because
of operator overloading:

In [11]:
def poly(x):
    return x + 2 * x**2 - x**3

poly(a)

array([   2.   ,  -14.875, -249.991,  -28.   , -138.   ])

In [12]:
poly(np.pi)

-8.125475224531307

## Useful properties

In [13]:
len(a)

5

In [14]:
a.shape

(5,)

In [15]:
a.dtype

dtype('float64')

In [16]:
a.ndim

1

## Arbitrary dimension arrays

In [17]:
# two-dimensional array
y = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

y + y

array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

In [18]:
## since python 3.5 @ is matrix product
y @ y

array([[ 30,  36,  42],
       [ 66,  81,  96],
       [102, 126, 150]])

In [19]:
# Broadcasting, changing array dimensions to fit the larger one

y + np.array([1, 2, 3])

array([[ 2,  4,  6],
       [ 5,  7,  9],
       [ 8, 10, 12]])

## Reduction operations

Numpy has many operations, which reduce dimensionality of arrays

In [20]:
x = np.random.normal(0, 1, 10)

In [21]:
np.sum(x)

-0.25860161547831051

In [22]:
np.prod(x)

-4.4678976520458003e-05

In [23]:
np.mean(x)

-0.025860161547831051

Standard Deviation

In [24]:
np.std(x)

0.772263139600875

Standard error of the mean

In [25]:
np.std(x, ddof=1) / np.sqrt(len(x))

0.25742104653362496

Sample Standard Deviation

In [26]:
np.std(x, ddof=1)

0.81403682471044714

Difference between neighbor elements

In [27]:
z = np.arange(10)**2
np.diff(z)

array([ 1,  3,  5,  7,  9, 11, 13, 15, 17])

### Reductions on multi-dimensional arrays


In [28]:
array2d = np.arange(20).reshape(4, 5)

array2d

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [29]:
np.sum(array2d, axis=0)

array([30, 34, 38, 42, 46])

In [30]:
np.mean(array2d, axis=1)

array([  2.,   7.,  12.,  17.])

## Exercise 1

Write a function that calculates the analytical linear regression for a set of
x and y values.

Reminder:

$$ f(x) = a \cdot x + b$$

with 

$$
\hat{b} = \frac{\mathrm{Cov}(x, y)}{\mathrm{Var}(x)} \\
\hat{a} = \bar{y} - \hat{b} \cdot \hat{x}
$$

In [31]:
# %load 04_01_numpy_solutions/exercise_linear.py

In [None]:
x = np.linspace(0, 1, 50)
y = 5 * np.random.normal(x, 0.1) + 2  # see section on random numbers later

print(linear_regression(x, y))

## Helpers for creating arrays

In [32]:
np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [33]:
np.ones((5, 2))

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

In [34]:
np.full(5, np.nan)

array([ nan,  nan,  nan,  nan,  nan])

In [35]:
np.empty(5)  # attention, uninitialised memory, be carefull

array([ 0.,  0.,  0.,  0.,  0.])

In [36]:
np.linspace(0, 1, 11)

array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ])

In [37]:
# like range() for arrays:
np.arange(0, 10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [38]:
np.logspace(-4, 5, 10)

array([  1.00000000e-04,   1.00000000e-03,   1.00000000e-02,
         1.00000000e-01,   1.00000000e+00,   1.00000000e+01,
         1.00000000e+02,   1.00000000e+03,   1.00000000e+04,
         1.00000000e+05])

## Numpy Indexing

* Element access
* Slicing

In [39]:
x = np.arange(0, 10)

# like lists:
x[4]

4

In [40]:
# all elements with indices ≥1 and <4:
x[1:4]

array([1, 2, 3])

In [41]:
# negative indices count from the end
x[-1], x[-2]

(9, 8)

In [42]:
# combination:
x[3:-2]

array([3, 4, 5, 6, 7])

In [43]:
# step size
x[::2]

array([0, 2, 4, 6, 8])

In [44]:
# trick for reversal: negative step
x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [45]:
y = np.array([x, x + 10, x + 20, x + 30])
y

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]])

In [46]:
# comma between indices
y[3, 2:-1]

array([32, 33, 34, 35, 36, 37, 38])

In [47]:
# only one index ⇒ one-dimensional array
y[2]

array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In [48]:
# other axis: (: alone means the whole axis)
y[:, 3]

array([ 3, 13, 23, 33])

In [49]:
# inspecting the number of elements per axis:
y.shape

(4, 10)

# Changing array content

In [50]:
y

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]])

In [51]:
y[:, 3] = 0
y

array([[ 0,  1,  2,  0,  4,  5,  6,  7,  8,  9],
       [10, 11, 12,  0, 14, 15, 16, 17, 18, 19],
       [20, 21, 22,  0, 24, 25, 26, 27, 28, 29],
       [30, 31, 32,  0, 34, 35, 36, 37, 38, 39]])

Using slices on both sides

In [52]:
y[:,0] = x[3:7]
y

array([[ 3,  1,  2,  0,  4,  5,  6,  7,  8,  9],
       [ 4, 11, 12,  0, 14, 15, 16, 17, 18, 19],
       [ 5, 21, 22,  0, 24, 25, 26, 27, 28, 29],
       [ 6, 31, 32,  0, 34, 35, 36, 37, 38, 39]])

Transposing inverts the order of the dimensions

In [53]:
y

array([[ 3,  1,  2,  0,  4,  5,  6,  7,  8,  9],
       [ 4, 11, 12,  0, 14, 15, 16, 17, 18, 19],
       [ 5, 21, 22,  0, 24, 25, 26, 27, 28, 29],
       [ 6, 31, 32,  0, 34, 35, 36, 37, 38, 39]])

In [54]:
y.shape

(4, 10)

In [55]:
y.T

array([[ 3,  4,  5,  6],
       [ 1, 11, 21, 31],
       [ 2, 12, 22, 32],
       [ 0,  0,  0,  0],
       [ 4, 14, 24, 34],
       [ 5, 15, 25, 35],
       [ 6, 16, 26, 36],
       [ 7, 17, 27, 37],
       [ 8, 18, 28, 38],
       [ 9, 19, 29, 39]])

In [56]:
y.T.shape

(10, 4)

# Masks

* A boolean array can be used to select only the element where it contains `True`.
* Very powerfull tool to select certain elements that fullfill a certain condition

In [57]:
a = np.linspace(0, 2, 11)
b = np.random.normal(0, 1, 11)

print(b >= 0)
print(a[b >= 0])

[ True  True  True  True  True  True False False False  True False]
[ 0.   0.2  0.4  0.6  0.8  1.   1.8]


In [58]:
a[b < 0] = 0
a

array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ,  0. ,  0. ,  0. ,  1.8,  0. ])

### Random numbers

* numpy has a larger number of distributions builtin

In [59]:
np.random.uniform(0, 1, 5)

array([ 0.50577223,  0.75369573,  0.60445567,  0.89630525,  0.24476256])

In [60]:
np.random.normal(5, 10, 5)

array([  4.29544427,  -1.0144828 ,  12.10904789,  -0.5179243 ,   1.60214551])

## Calculating pi through monte-carlo simulation

* We draw random numbers in a square with length of the sides of 2
* We count the points which are inside the circle of radius 1

The area of the square is

$$
A_\mathrm{square} = a^2 = 4
$$

The area of the circle is
$$
A_\mathrm{circle} = \pi r^2 = \pi
$$

With 
$$
\frac{n_\mathrm{circle}}{n_\mathrm{square}} = \frac{A_\mathrm{circle}}{A_\mathrm{square}}
$$
We can calculate pi:

$$
\pi = 4 \frac{n_\mathrm{circle}}{n_\mathrm{square}}
$$

In [61]:
n_square = 1000

x = np.random.uniform(-1, 1, n_square)
y = np.random.uniform(-1, 1, n_square)

radius = np.sqrt(x**2 + y**2)

n_circle = np.sum(radius <= 1.0)

print(4 * n_circle / n_square)

3.172


## Exercise

1. Draw 10000 gaussian random numbers with mean of $\mu = 2$ and standard deviation of $\sigma = 3$
2. Calculate the mean and the standard deviation of the sample
3. What percentage of the numbers are outside of $[\mu - \sigma, \mu + \sigma]$?
4. How many of the numbers are $> 0$?
5. Calculate the mean and the standard deviation of all numbers ${} > 0$

In [62]:
# %load 04_01_numpy_solutions/exercise_gaussian.py

## Exercise

Monte-Carlo uncertainty propagation

* The hubble constant as measured by PLANCK is
$$
H_0 = (67.74 \pm 0.47)\,\frac{\mathrm{km}}{\mathrm{s}\cdot\mathrm{Mpc}}
$$

* Estimate mean and the uncertainty of the velocity of a galaxy which is measured to be $(500 \pm 100)\,\mathrm{Mpc}$ away
using monte carlo methods

In [63]:
# %load 04_01_numpy_solutions/exercise_hubble.py

## Simple io functions

In [64]:
idx = np.arange(100)
x = np.random.normal(0, 1, 100)
y = np.random.normal(0, 1, 100)
n = np.random.poisson(20, 100)

In [65]:
idx.shape, x.shape, y.shape, n.shape

((100,), (100,), (100,), (100,))

In [66]:
np.savetxt(
    'data.csv',
    np.column_stack([idx, x, y, n]),
)

In [67]:
!head data.csv

0.000000000000000000e+00 -3.707558559158785072e-01 -1.080613066476477035e+00 1.400000000000000000e+01
1.000000000000000000e+00 1.647229946054118779e-02 -1.094131938820141620e-01 1.700000000000000000e+01
2.000000000000000000e+00 5.077597251130024913e-01 1.494831031901187601e-01 2.100000000000000000e+01
3.000000000000000000e+00 9.036172454735694748e-01 1.165135747518170950e+00 9.000000000000000000e+00
4.000000000000000000e+00 2.104572849075781349e+00 9.885148341064914357e-01 1.600000000000000000e+01
5.000000000000000000e+00 -1.460584079597229135e-01 -1.583422152471663846e+00 1.800000000000000000e+01
6.000000000000000000e+00 -5.920765390655726712e-01 3.953281331954364708e-01 1.700000000000000000e+01
7.000000000000000000e+00 1.620063666618965503e-01 9.768933578166620890e-01 2.100000000000000000e+01
8.000000000000000000e+00 1.628460029151629074e-01 -9.161714278629565222e-02 1.100000000000000000e+01
9.000000000000000000e+00 1.560382086931429324e+00 8.074386274527169949e-01 1.8000000

In [68]:
# Load back the data, unpack=True is needed to read the data columnwise and not row-wise
idx, x, y, n = np.genfromtxt('data.csv', unpack=True)

idx.dtype, x.dtype

(dtype('float64'), dtype('float64'))

### Problems

* Everything is a float
* Way larger file than necessary because of too much digits for floats
* No column names

## Numpy recarrays

* Numpy recarrays can store columns of different types
* Rows are addressed by integer index
* Columns are addressed by strings

Solution for our io problem → Column names, different types

In [69]:
data = np.savetxt(
    'data.csv',
    np.column_stack([idx, x, y, n]),
    delimiter=',', # true csv file
    header=','.join(['idx', 'x', 'y', 'n']),
    fmt=['%d', '%.4g', '%.4g', '%d'],
)

In [70]:
!head data.csv

# idx,x,y,n
0,-0.3708,-1.081,14
1,0.01647,-0.1094,17
2,0.5078,0.1495,21
3,0.9036,1.165,9
4,2.105,0.9885,16
5,-0.1461,-1.583,18
6,-0.5921,0.3953,17
7,0.162,0.9769,21
8,0.1628,-0.09162,11


In [71]:
data = np.genfromtxt(
    'data.csv',
    names=True, # load column names from first row
    dtype=None, # Automagically determince best data type for each column
    delimiter=',',
)

In [72]:
data[0]

(0, -0.3708, -1.081, 14)

In [73]:
data['x']

array([-0.3708 ,  0.01647,  0.5078 ,  0.9036 ,  2.105  , -0.1461 ,
       -0.5921 ,  0.162  ,  0.1628 ,  1.56   ,  0.8382 , -1.749  ,
       -0.6222 , -0.4225 , -0.4658 ,  0.2545 , -0.5461 ,  0.3204 ,
       -1.193  , -0.5417 , -0.1508 ,  0.1357 ,  0.06179, -0.5445 ,
       -0.2978 ,  0.3666 ,  1.14   , -0.138  ,  0.8756 ,  0.4264 ,
        0.4323 , -0.9988 , -1.659  , -0.1347 ,  0.4495 ,  1.314  ,
       -0.2972 , -1.91   ,  0.704  ,  0.587  , -0.4766 , -1.107  ,
        0.01694, -0.5696 , -2.634  , -0.8309 ,  1.852  , -0.71   ,
       -2.985  , -1.711  ,  0.5124 , -1.304  , -0.1348 , -1.607  ,
        0.0329 , -1.187  , -1.01   ,  0.2095 ,  1.123  , -1.66   ,
       -0.816  ,  0.8    ,  0.408  ,  0.406  ,  1.967  , -1.757  ,
        0.7764 , -0.036  ,  0.7693 , -0.1324 ,  1.219  ,  0.6129 ,
        0.2757 ,  1.392  , -0.9441 ,  1.237  , -0.882  ,  0.3506 ,
       -1.772  ,  1.064  , -0.2647 ,  1.81   , -0.1802 ,  0.9282 ,
       -0.09231, -0.7343 ,  0.677  , -0.06749,  0.7932 , -0.98

In [74]:
data.dtype

dtype([('idx', '<i8'), ('x', '<f8'), ('y', '<f8'), ('n', '<i8')])

## Linear algebra

Numpy offers a lot of linear algebra functionality, mostly wrapping LAPACK

In [75]:
# symmetrix matrix, use eigh
mat = np.array([
    [4, 2, 0],
    [2, 1, -3],
    [0, -3, 4]
])

eig_vals, eig_vecs = np.linalg.eig(mat)

eig_vals, eig_vecs

(array([-1.40512484,  4.        ,  6.40512484]),
 array([[  3.07818468e-01,   8.32050294e-01,  -4.61454330e-01],
        [ -8.31898624e-01,  -1.93604245e-16,  -5.54927635e-01],
        [ -4.61727702e-01,   5.54700196e-01,   6.92181495e-01]]))

In [76]:
np.linalg.inv(mat)

array([[ 0.13888889,  0.22222222,  0.16666667],
       [ 0.22222222, -0.44444444, -0.33333333],
       [ 0.16666667, -0.33333333,  0.        ]])

## Numpy matrices

Numpy also has a matrix class, with operator overloading suited for matrices

In [77]:
mat = np.matrix(mat)

In [78]:
mat.T

matrix([[ 4,  2,  0],
        [ 2,  1, -3],
        [ 0, -3,  4]])

In [79]:
mat * mat

matrix([[ 20,  10,  -6],
        [ 10,  14, -15],
        [ -6, -15,  25]])

In [80]:
mat * 5

matrix([[ 20,  10,   0],
        [ 10,   5, -15],
        [  0, -15,  20]])

In [81]:
mat.I

matrix([[ 0.13888889,  0.22222222,  0.16666667],
        [ 0.22222222, -0.44444444, -0.33333333],
        [ 0.16666667, -0.33333333,  0.        ]])

In [82]:
mat * np.matrix([1, 2, 3]).T

matrix([[ 8],
        [-5],
        [ 6]])