## The python data stack

* Overview of the mainstream packages that are used in Python for data analysis
* A look at two core packages
    * Numpy for fast array manipulation
    * Plotting with matplotlib

![./figures/scipy_stack.jpg](./figures/scipy_stack.jpg)

## Numpy

* The unified basis for all other data analysis tasks
* Provides the `array` datatype
* Interfaces with `C` and `FORTRAN` libraries for speed
* https://docs.scipy.org/doc/numpy/reference/

## Matplotlib

* Provides plotting functions
* The basis for other libraries, for example `pandas` builds on it
* Not the only standard, many alternatives
    * http://seaborn.pydata.org/examples/
    * https://plot.ly/python/range-slider/
    * http://bokeh.pydata.org/en/latest/docs/gallery/candlestick.html

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.arange(-2, 10)
y = np.exp(-x)

plt.plot(x, y)

In [None]:
plt.show()

In [None]:
## magic command that imports `matplotlib` and `numpy` as `plt` and `np`
## and lets plot appear automatically

%pylab inline

In [None]:
x = np.arange(-2, 10)
y = np.exp(-x)

plt.plot(x, y);

## Speed comparison

* Calculate $1 + \frac{1}{2} + \frac{1}{4} + \dots$

![./figures/geom.png](./figures/geom.png)

In [None]:
n = 10000

def loop(n):
    
    s = int()
    for i in range(n):
        s = s + 1 / (2 ** i)

    return s
        
loop(n) # use %time to find out exactly how long it took

In [None]:
np.arange(0, n)

In [None]:
2 ** np.arange(0, n)

In [None]:
1 / (2 ** np.arange(0, n))

In [None]:
%time (1 / (2 ** np.arange(0, n))).sum()

`np.array` | `list`
--- | ---
single datatype per array | multiple datatypes 
some supported datatypes | anything
convenient for mathematical functions | convenient for programming
used with vectorized functions | use with loops
fast | slow

## `np.array` creation

* From a list

`arr = np.array([2, 3, 4])`

* With functions such as `np.ones`, `np.arange` and `np.random`

`arr = np.ones(3)`

In [None]:
l = [1, 2, 3, 5]
arr = np.array(l)
arr

In [None]:
l = [1., 2, None]

np.array(l)

In [None]:
arr.dtype

In [None]:
arr = np.ones((3, 3))
arr

In [None]:
np.arange(3, 3.6, .02)

In [None]:
np.array([np.datetime64('2015-01-11 12:00')])

In [None]:
np.arange(1, 4, .4).reshape(4, 2)

In [None]:
## array indices work the same as python lists

np.arange(1, 40)[1:11:2]

In [None]:
## shape

m = np.array([
    [3, 4, 5],
    [2, 1, 6]
])

m

In [None]:
m.shape

![](./figures/anatomyarray.png)

## Reshaping

* Use `arr.reshape(rows, columns)` to resize
* Transposing is done with `arr.T`



In [None]:
m #.reshape((6, 1)) #.T

In [None]:
x #.reshape(3, 1)

## Matrix multiplication

In [None]:
x = np.array([1, 1, 1])
x

In [None]:
m

In [None]:
m.dot(x)

## `np.array` statistics

* Can be found as functions or as methods on the array
* Are the same in `pandas`

In [None]:
x = np.array([1, 2, 3, 4, 5])

np.mean(x)
np.sum(x)
np.var(x)

In [None]:
x.sum()
x.mean()
x.min()

## `np.array` functions

* Functions typically apply pointwise, to each point individually

In [None]:
1 / np.arange(1, 5)

In [None]:
np.sin(x)
x ** 2
np.sqrt(x)

## `np.array` adding, multiplication

* Also work pointwise with numbers

In [None]:
x = np.arange(5)
y = np.arange(5) * 3
x * 5

In [None]:
(x - y) ** 2

## `np.array` "Broadcasting"

In [None]:
m = np.arange(9).reshape(3, 3)
x = np.arange(3) * 2 + 1
m, x

In [None]:
m / x

In [None]:
## plotting example

x = np.arange(0, 2, .02)
y = np.exp(x)
z = np.exp(x * 2)

plt.plot(x, y)
plt.plot(x, z)

## Exercises

Use `plt.plot`, `np.sin` and `np.sum`

* Calculate the discounted value of $200 annually, for 50 years, starting right now. Assume an interest rate of 5%. 

$$\sum_{i=0,1,2\dots49}\frac{200}{(1.05)^i}$$

* Create a matrix with all numbers up to 24, in a three-by-eight shape. Calculate the mean over the columns and the mean over the rows. _Hint:_ You can divide by `np.sum(m, axis=1)` or use `np.mean()`.
* Plot $y = sin(x)$ and $y = sin(2x + 2)$.
* Plot $x^2$, $x^3$, $x^4$, $x^5$ and $x^6$ in the same figure. 

In [None]:
## Plot y=sin(x)y=sin(x)  and  y=sin(2x+2)

In [None]:
## Plot x^2, ^3, x^4, x^5 and x^6 in the same figure

In [None]:
## calculate the discounted value 

In [None]:
## Create a matrix with all numbers up to 24, in a three-by-eight shape. 
## Calculate the mean over the columns and the mean over the rows. 

### `np.random`

In [None]:
np.random.normal(0, 1)


In [None]:
np.random.normal(0, 1, 20)

In [None]:
x = np.random.normal(0, 1, 200)
y = 4 * x  + np.random.normal(0, 1, 200)

c = x + np.random.normal(0, 1, 200)

plt.scatter(x, y, c=c, cmap=plt.cm.Accent_r)

In [None]:
import statsmodels.api as sm

model = sm.OLS(y, x)
results = model.fit()
results.summary()

In [None]:
import pymc3 as pm

with pm.Model():
    
    sigma = pm.HalfNormal('sigma', 10)
    alpha = pm.Normal('alpha', 0, 10)
    obs = pm.Normal('obs', mu=alpha*x, sd=sigma, observed=y)

    trace = pm.sample(1000)
    
pm.traceplot(trace[500:]);

In [None]:
## title, grid and size

x = np.arange(-1, 1, .01)
y = x ** 2

plt.figure(figsize=(4, 2))
plt.plot(x, y)
plt.grid(True)
plt.title("x squared")

In [None]:
## creating labels and placing a legend

x = np.arange(-1, 1, .01)

for exp in range(2, 7):   
    plt.plot(x, x ** exp, label = "$x^{}$".format(exp))

plt.grid()
plt.legend(loc='lower right')
plt.savefig("./test-plot.png")

In [None]:
## histogram and axvline (there is also axhline)

n = 500
samples = np.random.exponential(5, n)

plt.figure(figsize=(8, 3))
plt.hist(samples, bins=np.arange(0, 40, 1), color='lightgreen');
plt.axvline(samples.mean(), ls='--', c='r')
plt.title("{n} samples from the exponential distribution".format(n=n))
plt.xlabel("x")
plt.ylabel("y")
plt.xticks(rotation=60)
plt.grid()

In [None]:
## multiple plots in one figure using subplots

x = np.arange(-2, 2, .1)

fig, rows = plt.subplots(2, 2, figsize=(8, 4), sharex=True, sharey=True)

for r_index, row in enumerate(rows):
    for c_index, ax in enumerate(row):
        
        power = ((r_index * 2) + (c_index + 1))
        
        ax.plot(x, x ** power)
        ax.grid(True)
        ax.set_title('Y=X^{}'.format(power))

plt.tight_layout()

## Exercises

Use `plt.scatter`

* Create a scatterplot of two negatively correlated values
* Color them according to the horizontal axis value, use `cm.Blues` as the colormap

Use `plt.hist`

* Plot a density for the normal distribution, give it an appropriate name and save the file to your desktop 
* Same for a distribution that is the sum of two normal distributions with means 2 and 5, and standard deviations of 1 and 4

Use `plt.subplots`

* Create 4 subplots with histograms of 200 samples of the normal distribution with mean 0 and standard deviation equal to 1, 2, 3 and 4. 