## Numpy

In [131]:
import numpy as np

* `np` slices and item definitions work just as you would expect them to work, given how Python strings behave, except that `np` returns *views*, not *copies*. To get deep copies we need to use `.copy()`.
* You can do boolean tests on `array` objects, stuff like `data < .7`. This will run itemwise on all entries.
* The `True`/`False` array that results from such an operation can be used to mask another array.

In [132]:
data = np.random.randn(7,4)
data

array([[ 0.43475139,  0.18389037,  0.62009378,  0.08195118],
       [-0.20651149, -0.42358327,  0.99540519,  0.42433063],
       [-0.95819202, -1.03014313, -0.09690439, -1.43257016],
       [ 0.20939109,  1.39082271,  0.62960829,  0.47035797],
       [-1.84189982, -0.70490387,  0.38669305,  0.92276789],
       [-1.28707195,  0.45958711,  1.43616977, -1.13182685],
       [-0.18321675,  2.46871775, -0.2184558 , -1.40806059]])

In [133]:
data[:1]

array([[ 0.43475139,  0.18389037,  0.62009378,  0.08195118]])

In [134]:
data < .7

array([[ True,  True,  True,  True],
       [ True,  True, False,  True],
       [ True,  True,  True,  True],
       [ True, False,  True,  True],
       [ True,  True,  True, False],
       [ True,  True, False,  True],
       [ True, False,  True,  True]], dtype=bool)

In [135]:
data[data < .7]

array([ 0.43475139,  0.18389037,  0.62009378,  0.08195118, -0.20651149,
       -0.42358327,  0.42433063, -0.95819202, -1.03014313, -0.09690439,
       -1.43257016,  0.20939109,  0.62960829,  0.47035797, -1.84189982,
       -0.70490387,  0.38669305, -1.28707195,  0.45958711, -1.13182685,
       -0.18321675, -0.2184558 , -1.40806059])

In [136]:
data[data < .7].shape

(23,)

* You can matrix multiply boolean masks across arrays. This will remove false-multiplied columns.

In [137]:
names = np.array(['Al', 'Bo', 'Bo', 'Bo', 'Al', 'Al', 'Bo'])
names

array(['Al', 'Bo', 'Bo', 'Bo', 'Al', 'Al', 'Bo'], 
      dtype='<U2')

In [138]:
names != 'Al'

array([False,  True,  True,  True, False, False,  True], dtype=bool)

In [139]:
data

array([[ 0.43475139,  0.18389037,  0.62009378,  0.08195118],
       [-0.20651149, -0.42358327,  0.99540519,  0.42433063],
       [-0.95819202, -1.03014313, -0.09690439, -1.43257016],
       [ 0.20939109,  1.39082271,  0.62960829,  0.47035797],
       [-1.84189982, -0.70490387,  0.38669305,  0.92276789],
       [-1.28707195,  0.45958711,  1.43616977, -1.13182685],
       [-0.18321675,  2.46871775, -0.2184558 , -1.40806059]])

In [140]:
data[names != 'Al']

array([[-0.20651149, -0.42358327,  0.99540519,  0.42433063],
       [-0.95819202, -1.03014313, -0.09690439, -1.43257016],
       [ 0.20939109,  1.39082271,  0.62960829,  0.47035797],
       [-0.18321675,  2.46871775, -0.2184558 , -1.40806059]])

* Array bitwise logical operations are via `|` and `&`.

In [141]:
np.array([False, False]) | np.array([True, True])

array([ True,  True], dtype=bool)

In [142]:
np.array([False, True]) & np.array([True, False])

array([False, False], dtype=bool)

* You can assign data based on a mask.

In [143]:
data[data < .7] = 0
data

array([[ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.99540519,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  1.39082271,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.92276789],
       [ 0.        ,  0.        ,  1.43616977,  0.        ],
       [ 0.        ,  2.46871775,  0.        ,  0.        ]])

* You can call rows in particular order using *fancy indexing*.
* Note that fancy indexing, unlike slicing, returns copies, not views.

In [144]:
data[[1,3,5]]

array([[ 0.        ,  0.        ,  0.99540519,  0.        ],
       [ 0.        ,  1.39082271,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  1.43616977,  0.        ]])

* `where` is a useful multidimensional conditional constructor. The format is `np.where(mask, true_repr, false_repr)`.

In [145]:
arr = np.random.randn(4,4)
arr

array([[-0.24457999,  0.22749908, -0.22941093,  0.5969774 ],
       [ 1.53239961, -0.2178087 ,  0.38933615,  0.21919714],
       [ 0.70018244,  0.88972195,  1.22775003, -0.22000172],
       [ 0.12622223, -1.05183344, -0.77470244,  0.60850113]])

In [146]:
np.where(arr > 0, 2, -2)

array([[-2,  2, -2,  2],
       [ 2, -2,  2,  2],
       [ 2,  2,  2, -2],
       [ 2, -2, -2,  2]])

In [147]:
np.where(arr > 0, 2, arr)

array([[-0.24457999,  2.        , -0.22941093,  2.        ],
       [ 2.        , -0.2178087 ,  2.        ,  2.        ],
       [ 2.        ,  2.        ,  2.        , -0.22000172],
       [ 2.        , -1.05183344, -0.77470244,  2.        ]])

* `any()` and `all()` provide quick multidimensional conditional tests.

In [148]:
arr.any()

True

In [149]:
arr.all()

True

In [150]:
arr[1][1] = 0
arr.all()

False

## Pandas

In [151]:
from pandas import Series, DataFrame
import pandas as pd

* The `Series` object is the first half of the workhorse structures in `pd`. It is an array-like object which indexes data.
* Values can be retrieved with `values`, the index with `index`.
* The index can be morphed into whatever you want it to be. You can call objects by non-numerical index in a dictionary way.
* You can pass a `dict` to define a series.
* Missing entries are represented with `NaN`. `isnull()` can be used to check.
* Indices and their corresponding entries are convoluted in arithmatic operations. `NaN` entries propogate.

In [152]:
obj = Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [153]:
obj.values

array([ 4,  7, -5,  3])

In [154]:
obj.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [155]:
obj2 = Series([4,7,-5,3], index=['a', 'b', 'c', 'd'])
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [156]:
obj2['a']

4

In [157]:
obj3 = Series({'a': 4, 'b': 7, 'c':-5})
obj3

a    4
b    7
c   -5
dtype: int64

In [158]:
obj2 + obj3

a     8
b    14
c   -10
d   NaN
dtype: float64

In [159]:
(obj2 + obj3).isnull()

a    False
b    False
c    False
d     True
dtype: bool

* The `DataFrame` object is the other primary-use object in pandas. It's a table.
* You can pull a `Series` out of it using `dict` notation.
* You can pull a row out using `.ix[index_num]`.
* You can assign a `Series` to a `DataFrame`, this will `NaN` missing values. Matrices on the other hand have to be explicit.

In [160]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [161]:
frame['debt'] = float('NaN')
frame

Unnamed: 0,pop,state,year,debt
0,1.5,Ohio,2000,
1,1.7,Ohio,2001,
2,3.6,Ohio,2002,
3,2.4,Nevada,2001,
4,2.9,Nevada,2002,


In [162]:
frame['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [163]:
frame.ix[0]

pop       1.5
state    Ohio
year     2000
debt      NaN
Name: 0, dtype: object

* You can `.T` a frame to transpose it.
* The frame's `index` is actually an `Index` instance, and is immutable so that it can be shared amongst multiple frames.

In [164]:
frame.T

Unnamed: 0,0,1,2,3,4
pop,1.5,1.7,3.6,2.4,2.9
state,Ohio,Ohio,Ohio,Nevada,Nevada
year,2000,2001,2002,2001,2002
debt,,,,,


In [165]:
frame.index

Int64Index([0, 1, 2, 3, 4], dtype='int64')

* `reindex` re-indexes a series.
* You can specify an alternative fill value with `fill_value` (default is `NaN`).
* You can also front-fill or back-fill with `method=ffill` and `method=bfill`.
* You can pass a `columns` parameter to re-index columns.

In [166]:
obj = Series([4,7,-5,3], index=['d', 'b', 'a', 'c'])
obj

d    4
b    7
a   -5
c    3
dtype: int64

In [167]:
obj.reindex(['a', 'b', 'c', 'd', 'e'])

a    -5
b     7
c     3
d     4
e   NaN
dtype: float64

In [168]:
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5
b    7
c    3
d    4
e    0
dtype: int64

* `reindex` works on `DataFrames` as well: it reindexes on their rows.
* To `reindex` columns just pass the new index using the `column=` keyword.

In [169]:
frame

Unnamed: 0,pop,state,year,debt
0,1.5,Ohio,2000,
1,1.7,Ohio,2001,
2,3.6,Ohio,2002,
3,2.4,Nevada,2001,
4,2.9,Nevada,2002,


In [170]:
frame.reindex([1, 2, 3, 4, 0])

Unnamed: 0,pop,state,year,debt
1,1.7,Ohio,2001,
2,3.6,Ohio,2002,
3,2.4,Nevada,2001,
4,2.9,Nevada,2002,
0,1.5,Ohio,2000,


In [171]:
frame.reindex(columns=['state', 'year', 'pop'])

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


* To drop an entry from a `Series` or `DataTable` use `drop()`.
* To drop a column from a `DataTable` add `axis=1` as an argument.
* `drop()` is *not* an in-place operation.

In [172]:
frame2 = frame.copy()
frame2 = frame2.drop(0)
frame2

Unnamed: 0,pop,state,year,debt
1,1.7,Ohio,2001,
2,3.6,Ohio,2002,
3,2.4,Nevada,2001,
4,2.9,Nevada,2002,


In [173]:
frame2 = frame2.drop('year', axis=1)
frame2

Unnamed: 0,pop,state,debt
1,1.7,Ohio,
2,3.6,Ohio,
3,2.4,Nevada,
4,2.9,Nevada,


* `.ix` provides row bits.
* `frame` indexing and slicing and getting and setting is just what it is for `numpy`. Try to use `.ix` to get values.

In [174]:
frame['pop'] > 2

0    False
1    False
2     True
3     True
4     True
Name: pop, dtype: bool

In [180]:
frame[frame['pop'] > 2] = 2
frame

Unnamed: 0,pop,state,year,debt
0,1.5,Ohio,2000,
1,1.7,Ohio,2001,
2,2.0,2,2,2.0
3,2.0,2,2,2.0
4,2.0,2,2,2.0


In [176]:
frame.ix[0]

pop       1.5
state    Ohio
year     2000
debt      NaN
Name: 0, dtype: object

In [184]:
frame.ix[0, 'pop'] = 1.49
frame

Unnamed: 0,pop,state,year,debt
0,1.49,Ohio,2000,
1,1.7,Ohio,2001,
2,2.0,2,2,2.0
3,2.0,2,2,2.0
4,2.0,2,2,2.0


* `pd` data alignment is much like `np` data alignment.