## Exploratory work with IPython Notebook and Pandas

### IPython Notebooks

* Web based interactive computing environment
* Provide a browser based REPL (Read-eval-print loop)
* Notebook connects to an IPython server (backend is pluggable - see Project Jupyter)
* Export functionality (eg these slides)
* NOT for engineering

* IPython Notebook docs: http://ipython.org/ipython-doc/3/notebook/notebook.html
* Project Jupyter: https://jupyter.org/
* Indepth tutorial: http://ipython.org/notebook.html#scipy-2013

####  Notebook good practice

* Notebooks should be re-runnable
* Notebooks should be kept under version control
* Self-documenting (read like a report)
* Make code modular (define functions outside, import as needed)

* Setting up a server: http://ipython.org/ipython-doc/3/notebook/public_server.html
* Security (executing code in a browser): http://ipython.org/ipython-doc/3/notebook/security.html

#### Commit diffs

If possible clear output before saving and committing:

https://github.com/pelucid/devtools/commit/07cd08a3666101f9302735b091c851a2ee930455

Tend to end up with useless diffs:

https://github.com/pelucid/meg/commit/3597b4f30ce73d7ad9595e2ee847a62c6424fce9?diff=split#diff-5439da8795e32767a1974e756961cd4d

* Pre-commit hooks (http://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks), Python scripts to clear out output

#### Easy wins

* Virtual env
* User agnostic db access (eg .my.cnf)
* Public data (/var/data/client/... not /home/alex/stuff/faff/data/big.csv)
* Try to clear output before committing 
* Each has own server (resource hungry)

#### Magics

* Mini command language inside IPython

https://ipython.org/ipython-doc/dev/interactive/magics.html

* Some common magics within IPython Notebooks are:
** %lsmagic - list currently available magics
** %matplotlib inline - inline backend
** %env - manage environment variables



# TODOS:
- Latex/raw
- nbviewer
- Include an example of a notebook - where is this most useful?

### NumPy

* 'Numerical Python'
* Key datatype is the ```ndarray```
* Fast and broadcast operations
* Linear algebra
* Integrate with C, C++, Fortran
* And lots more


In [1]:
import numpy as np

data = [1, 2, 3, 4]
array = np.array(data)
array.shape

(4,)

In [2]:
np.max(array)

4

In [9]:
np.ones((10, 10))

array([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]])

In [10]:
array

array([1, 2, 3, 4])

In [11]:
array[1:]

array([2, 3, 4])

In [12]:
np.sqrt(array)

array([ 1.        ,  1.41421356,  1.73205081,  2.        ])

In [13]:
np.mean(array)

2.5

In [None]:
mat 

Also:
* Linear algebra (see ```np.linalg``` package)
* Set logic
* Sorting
* Input/output

### pandas

* Built on top of NumPy (so plays well with NumPy based libraries eg scikit learn)
* Two key data structures are ```Series``` and ```DataFrame```
* ```Series``` is a one dimensional array with an ```index```
* ```DataFrame``` is a tabular/spreadsheet with ordered collection of columns (can be thought of as a ```dict``` of ```Series```)
* Functionality for selecting, filtering, broadcast operations, time series, plotting and lots more


In [24]:
from pandas import Series, DataFrame
import pandas as pd

s = Series(np.random.rand(10))
s

0    0.230771
1    0.121939
2    0.402066
3    0.696302
4    0.601025
5    0.991137
6    0.354698
7    0.875854
8    0.134958
9    0.145949
dtype: float64

In [16]:
s.index

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [22]:
s[s > 0.5]

0    0.605849
1    0.863140
2    0.634565
6    0.833650
7    0.880774
8    0.863164
dtype: float64

In [26]:
df = DataFrame({'a':np.random.rand(5), 'b':np.random.rand(5)})
df

Unnamed: 0,a,b
0,0.558577,0.075653
1,0.749256,0.341849
2,0.759961,0.088551
3,0.484817,0.711775
4,0.572196,0.890946


Many constructors:
* ndarray
* dict of arrays
* dict of Series
* dict of dicts
* List of dicts 
* List of lists or tuples
* Another DataFrame

In [27]:
df.values

array([[ 0.55857676,  0.07565304],
       [ 0.74925619,  0.34184864],
       [ 0.75996097,  0.08855115],
       [ 0.48481745,  0.71177506],
       [ 0.57219592,  0.89094611]])

Look at:
* Broadcast operations ```df.mean()```
* Working with missing data ```df.fillna()```
* Comparisons ```df.gt()```
* Descriptive/statistics ```df.describe(), df.nunique(), df.hist()```
* Row/column/element-wise function application ```df.apply(lambda x: x**2)```
* Iteration over rows/columns ```df.iterrows()```
* Vectorized string methods ```df.str.lower()```
* SQL-like merging ```pd.merge()```
* Grouping ```df.groupby().groups```
* Re-indexing and re-aligning
* Reshaping ```pd.pivot_table()```
* Pltting ```df.hist()```
* IO from csv, text, HDF5
* Timeseries

http://www.xmind.net/m/WvfC/