#Pandas Cheat Sheet

References:  
<http://pandas.pydata.org/pandas-docs/stable/basics.html>  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/10min.html>    
<http://synesthesiam.com/posts/an-introduction-to-pandas.html>  
<http://pbpython.com/excel-pandas-comp.html>  
<http://pbpython.com/excel-pandas-comp-2.html>  
<http://pbpython.com/improve-pandas-excel-output.html>  

<http://pbpython.com/pandas-pivot-table-explained.html.
<http://pbpython.com/pandas-pivot-report.html>

<http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook-multi-index>  

<http://www.bigdataexaminer.com/14-best-python-pandas-features/>  
<http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping>  
<https://iqbalnaved.wordpress.com/2013/08/26/python-pandas-hacks/>   

<https://plot.ly/ipython-notebooks/big-data-analytics-with-pandas-and-sqlite/>  
<http://www.analyticsvidhya.com/blog/2015/04/comprehensive-guide-data-exploration-sas-using-python-numpy-scipy-matplotlib-pandas/>  

<http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/>  
<http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/>  


In [23]:
import numpy as np
import pandas as pd

import datetime

<https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf>   

|Create pandas data structures| |
|--|--|
|s = Series(data, index) |Create a Series.|
|df = DataFrame (data, index, columns) |Create a Dataframe.|
|p = Panel(data, items, major_axis, minor_axis)|Create a Panel.|


|	DataFrame Commands	|		|
|--|--|
|	df[col]	|	Select column.	|
|	df.iloc[label]	|	Select row by label.	|
|	df.index	|	Return DataFrame index.	|
|	df.drop()	|	Delete given row or column. Pass axis=1 for columns.	|
|	df1 = df1.reindex_like(df1,df2)	|	Reindex df1 with index of df2.	|
|	df.reset_index()	|	Reset index, putting old index in column named index.	|
|	df.reindex()	|	Change DataFrame index, new indecies set to NaN.	|
|	df.head(n)	|	Show first n rows.	|
|	df.tail(n)	|	Show last n rows.	|
|	df.sort()	|	Sort index.	|
|	df.sort(axis=1)	|	Sort columns.	|
|	df.pivot(index,column,values)	|	Pivot DataFrame, using new conditions.	|
|	df.T	|	Transpose DataFrame.	|
|	df.stack()	|	Change lowest level of column labels into innermost row index.	|
|	df.unstack()	|	Change innermost row index into lowest level of column labels.	|
|	df.applymap()	|	Apply function to every element in DataFrame.	|
|	df.apply()	|	Apply function along a given axis	|
|	df.dropna()	|	Drops rows where any data is missing.	|
|	df.count()	|	Returns Series of row counts for every column.	|
|	df.min()	|	Return minimum of every column.	|
|	df.max()	|	Return maximum of every column.	|
|	df.describe()	|	Generate various summary statistics for every column.	|
|	concat()	|	Merge DataFrame or Series objects	|	

|	Groupby	|		|
|--|--|
|	groupby()	|	Split DataFrame by columns. Creates a GroupBy object (gb).	|
|	gb.agg()	|	Apply function (single or list) to a GroupBy object.	|
|	gb.transform()	|	Applies function and returns object with same index as one being grouped.	|
|	gb.filter()	|	Filter GroupBy object by a given function.	|
|	gb.groups	|	Return dict whose keys are the unique groups, and values are axis labels belonging to each group.	|

|	I/O	|		|
|--|--|
|	df.to_csv('foo.csv')	|	Save to CSV.	|
|	read_csv('foo.csv')	|	Read CSV into DataFrame.	|
|	to_excel('foo.xlsx', sheet_name)	|	Save to Excel.	|
|	read_excel('foo.xlsx','sheet1', index_col = None, na_values = ['NA'])	|	Read exel into DataFrame	|

	


##Set the display width when printing to console

There are quite a few options to configure here, if you're using ipython then tab complete to find the [full set](http://pandas.pydata.org/pandas-docs/version/0.15.2/options.html) of display options:

    pd.options.display.<tab>

<http://stackoverflow.com/questions/21249206/how-to-configure-display-output-in-ipython-pandas>

In [24]:
#maximum number of rows and columns displayed when a frame is pretty-printed
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 10)
# Width of the display in characters.
pd.set_option('display.width', 150)
# The maximum width in characters of a column in the repr of a pandas data structure.              
pd.set_option('display.max_colwidth', 150)

##Creating/Loading Data

###Functions to create different dataframe types

An empty DataFrame can be created as follows. Test to see if the DataFrame is empty. In this case it is.

In [25]:
columns = ['A','B', 'C']
df = pd.DataFrame(columns=columns)
print(df)
print(df.empty)

Empty DataFrame
Columns: [A, B, C]
Index: []
True


If you add an index, the row contents for the rows specified by the index will be empty (filled with NaN).  However, testing to see if the DataFrame is empty will show that it is not empty: there are rows.

In [26]:
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=3, freq='D')
columns = ['A','B', 'C']
df = pd.DataFrame(index=index, columns=columns)
print(df)
print(df.empty)
df = df.fillna(0) # with 0s rather than NaNs
print(df)

              A    B    C
2015-06-30  NaN  NaN  NaN
2015-07-01  NaN  NaN  NaN
2015-07-02  NaN  NaN  NaN
False
            A  B  C
2015-06-30  0  0  0
2015-07-01  0  0  0
2015-07-02  0  0  0


###Creating and filling DataFrames

The following  functions create pandas dataframes in a variety of ways.

In [27]:
# DataFrame by passing a numpy array, with a datetime index and labeled columns.
def makeDateRand(nrows=6, ncols=4):
    dates = pd.date_range('20130101',periods=6)
    df = pd.DataFrame(np.random.randn(nrows,ncols),index=dates,columns=list('ABCD'))    
    return df

In [28]:
# DataFrame by passing a numpy array, with an int index and labeled columns.
def makeRand(nrows=4, ncols=4):
    return pd.DataFrame(np.random.randn(nrows, ncols), columns=['A','B','C','D'])

In [29]:
#create from dictionary
def makefoobar():
    return  pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                          'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                          'C' : np.random.randn(8),
                          'D' : np.random.randn(8)})

In [30]:
#create from dictionary
def makegridDF():
    return  pd.DataFrame({'A' : [1,2,3],
                          'B' : [4,5,6],
                          'C' : [7,8,9],
                          'D' : [10,11,12]})

In [31]:
#create from dictionary
def makegAlphaDF():
    return  pd.DataFrame({'A' : ['0a','1a','2a'],
                          'B' : ['0b','1b','2b'],
                          'C' : ['0c','1c','2c'],
                          'D' : ['0d','1d','2d']})

In [32]:
#create dataframe from a user-supplied string, using a user-defined regex separator 
def makeFromString(string, sep='\s+', header=False):
    from StringIO import StringIO
    return pd.read_csv(StringIO(string), sep=sep, header=header)
#alternative method
#     import io
#     return pd.read_table(io.BytesIO(content), sep=sep, header=header)

In [33]:
# DataFrame by passing a dict of objects that can be converted to series-like.
# using categorical in column E
def makecatedf():
    df = pd.DataFrame({'A' : 1.,
                       'B' : pd.Timestamp('20130102'),
                       'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                       'D' : np.array([3] * 4,dtype='int32'),
                       'E' : pd.Categorical(["test","train","test","train"]),
                       'F' : 'foo',
                       'G': ['foox','fooa','foon','fooz']})
    return (df)

In [34]:
#create a dataframe with a NaN
def makeNaNdf():
    return pd.DataFrame([[1, np.nan], [3, 4], [4,5]], columns=list('AB'))

In [35]:
#create a DataFrame with hierarchical column index
#From http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
def createMultColIdx():
    return pd.DataFrame([list('abcd'),
                  list('efgh'),
                  list('ijkl'),
                  list('mnop')],
                  columns=pd.MultiIndex.from_product([['one','two'],
                      ['first','second']]))

In [36]:
print(makeDateRand())

                   A         B         C         D
2013-01-01 -0.257747  1.882416 -0.865108 -0.892473
2013-01-02 -0.010006  0.766364  2.787611 -0.068185
2013-01-03 -0.058180 -0.402305 -0.481123 -1.451116
2013-01-04 -0.093505  0.445993 -0.557638  0.371417
2013-01-05  1.041981  0.355350 -1.165290 -0.135739
2013-01-06 -1.529945  0.092371  0.751263 -0.148109


Display the data types

In [37]:
df2 = makecatedf()
print(df2.dtypes)
df2

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
G            object
dtype: object


Unnamed: 0,A,B,C,D,E,F,G
0,1,2013-01-02,1,3,test,foo,foox
1,1,2013-01-02,1,3,train,foo,fooa
2,1,2013-01-02,1,3,test,foo,foon
3,1,2013-01-02,1,3,train,foo,fooz


In [38]:
content = '''
Time       A_x       A_y       A_z       B_x       B_y       B_z
-0.075509 -0.123527 -0.547239 -0.453707 -0.969796  0.248761  1.369613
-0.133580 -0.308314 -0.839347 -0.517989  0.652120  0.477232 -0.391767
 0.623841  0.473552  0.059428  0.726088 -0.593291 -3.186297 -0.846863'''

makeFromString(content)

Unnamed: 0,Time,A_x,A_y,A_z,B_x,B_y,B_z
0,-0.075509,-0.123527,-0.547239,-0.453707,-0.969796,0.248761,1.369613
1,-0.13358,-0.308314,-0.839347,-0.517989,0.65212,0.477232,-0.391767
2,0.623841,0.473552,0.059428,0.726088,-0.593291,-3.186297,-0.846863


If you're using IPython, tab completion for column names (as well as public attributes) is automatically enabled:  
    `df2.<Tab>`

##Display two tables side-by-side

<https://gist.github.com/stefanv/6416926>  

In [39]:
class side_by_side():
    def __init__(self, *frames):
        self.frames = frames

    def _repr_html_(self):
        width = 100. / len(self.frames)

        s = ""
        for f in self.frames:
            s += "<div style='float: left;'>%s</div>" % f._repr_html_()

        return s

In [40]:
side_by_side(makeDateRand(), makeDateRand())

Unnamed: 0,A,B,C,D
2013-01-01,-1.51689,0.021219,1.20133,2.145407
2013-01-02,0.982826,0.544687,-0.156427,-1.332253
2013-01-03,-1.026995,0.546285,1.093837,-1.519708
2013-01-04,-0.044097,-1.308703,1.843453,1.227321
2013-01-05,0.051633,-1.845223,0.341281,-0.800861
2013-01-06,2.025568,0.514407,1.21106,0.743768

Unnamed: 0,A,B,C,D
2013-01-01,-3.148736,-0.00498,0.458335,0.965817
2013-01-02,-1.450596,0.313772,-0.01036,-0.664239
2013-01-03,0.915434,-1.95357,0.383879,-0.813461
2013-01-04,-0.866156,1.403974,-0.728736,-2.481252
2013-01-05,-1.25537,-0.016967,-0.44605,-0.947208
2013-01-06,-0.305307,0.271728,1.386841,1.119769


##File Input/Output

###CSV and Excel

In [41]:
df = makeDateRand()
df.to_csv('foo.csv')
pd.read_csv('foo.csv')

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2013-01-01,-0.815654,0.182509,0.338265,-0.346617
1,2013-01-02,1.004594,0.161185,0.2061,-0.433438
2,2013-01-03,-1.09163,-0.122748,-1.350434,1.040738
3,2013-01-04,1.317175,-1.865405,-0.289084,1.609441
4,2013-01-05,0.561438,-0.698553,0.029656,1.648334
5,2013-01-06,-0.020715,-2.488948,1.729207,0.494589


In [51]:
import pandas as pd
# use whitespace as separator
df = makeDateRand()
df.output_table('foo.csv',sep=[' '])
pd.read_table('foo.csv', sep='\s+')

AttributeError: 'DataFrame' object has no attribute 'output_table'

In [43]:
df = makeDateRand()
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Unnamed: 0,A,B,C,D
2013-01-01,-0.375074,-0.05133,-0.561504,0.388171
2013-01-02,0.087058,1.352036,-0.411912,-0.48648
2013-01-03,0.336514,-0.160262,-1.033558,-0.118928
2013-01-04,-0.481207,-0.796891,0.602879,-0.649082
2013-01-05,-0.534907,-2.023231,1.155798,0.255004
2013-01-06,-1.592183,-0.427264,2.041786,-0.95541


###HDF5

Excel and CSV  formats can only store single elements per 'cell.   HDF5 provides the means to store hierarchical data, where some DataFrame cells can contain Numpy arrays or other structures.  The example below has a Numpy array in column 'array' and a pandas Series in column 'dframe'.  

Note that the Series index must match the dataframe index, otherwise the Series elements cannot be assigned (NaN are then assigned to all elements in the column).

<http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#io-hdf5>

In [34]:
df = makeDateRand()
df['arrays'] = [np.asarray([[1,x],[x/2,4*x]]) for x in range(6)]
ser = pd.Series([np.asarray([[1,x],[x/2,4*x]]) for x in range(6)],index=pd.date_range('20130101',periods=6))
print(ser)
df['dframe'] = ser
print(df)
df.to_hdf('df.hdf5','df',mode='w',append=False)
df = pd.read_hdf('df.hdf5', 'df')
print(df)
print(df.dtypes)

2013-01-01     [[1, 0], [0, 0]]
2013-01-02     [[1, 1], [0, 4]]
2013-01-03     [[1, 2], [1, 8]]
2013-01-04    [[1, 3], [1, 12]]
2013-01-05    [[1, 4], [2, 16]]
2013-01-06    [[1, 5], [2, 20]]
Freq: D, dtype: object
                   A         B         C         D             arrays             dframe
2013-01-01  1.430077 -0.558571  1.100196  0.274112   [[1, 0], [0, 0]]   [[1, 0], [0, 0]]
2013-01-02  0.560613  0.022041 -1.367565 -1.254090   [[1, 1], [0, 4]]   [[1, 1], [0, 4]]
2013-01-03  1.992973  0.390096  0.564768 -0.685658   [[1, 2], [1, 8]]   [[1, 2], [1, 8]]
2013-01-04  0.019785 -0.491109 -0.124125 -1.651703  [[1, 3], [1, 12]]  [[1, 3], [1, 12]]
2013-01-05 -0.492693 -2.297549  2.269188  0.214106  [[1, 4], [2, 16]]  [[1, 4], [2, 16]]
2013-01-06  0.080028 -2.080811  0.843215  0.385552  [[1, 5], [2, 20]]  [[1, 5], [2, 20]]
                   A         B         C         D             arrays             dframe
2013-01-01  1.430077 -0.558571  1.100196  0.274112   [[1, 0], [0, 0]]   [

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['arrays', 'dframe']]



##Dataframe properties

###Row properties

The row count can be obtained in two different forms:

- The `len` and `shape` methods count the number of rows in the DataFrame, irrespective of the contents of the cells.
- The `df.count()` function returns a pandas series containing the number of valid entries in a column - ignoring NaN. 

In [35]:
df = makeNaNdf()
print(df)
print('\nlen(df) = {}'.format(len(df)))
print('\nshape[0] = {}'.format(df.shape[0]))
print('\ntype(df.count()) = {}'.format(type(df.count())))
print('\ndf.count() = \n{}'.format(df.count()))
print("\ndf.count()['A'] = {}".format(df.count()['A']))
print('\ndf.count()[1] = {}'.format(df.count()[1]))
print('\nNans = \n{}'.format(df.apply(lambda col: pd.isnull(col))))

   A   B
0  1 NaN
1  3   4
2  4   5

len(df) = 3

shape[0] = 3

type(df.count()) = <class 'pandas.core.series.Series'>

df.count() = 
A    3
B    2
dtype: int64

df.count()['A'] = 3

df.count()[1] = 2

Nans = 
       A      B
0  False   True
1  False  False
2  False  False


The DataDrame index (row names) can be retrieved as a [`pandas.index `](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html), which can be used to retrieve a list of the row names:

In [36]:
df = makegridDF()
print(df.index)
print(df.index.tolist())

Int64Index([0, 1, 2], dtype='int64')
[0, 1, 2]


If the DataDrame index is a more complex data type, the data type is returned:

In [37]:
df = makeDateRand()
print(df.index)
print(df.index.tolist()) # returns a list
print(df.index.values) # returns an array

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None
[Timestamp('2013-01-01 00:00:00', offset='D'), Timestamp('2013-01-02 00:00:00', offset='D'), Timestamp('2013-01-03 00:00:00', offset='D'), Timestamp('2013-01-04 00:00:00', offset='D'), Timestamp('2013-01-05 00:00:00', offset='D'), Timestamp('2013-01-06 00:00:00', offset='D')]
['2013-01-01T02:00:00.000000000+0200' '2013-01-02T02:00:00.000000000+0200'
 '2013-01-03T02:00:00.000000000+0200' '2013-01-04T02:00:00.000000000+0200'
 '2013-01-05T02:00:00.000000000+0200' '2013-01-06T02:00:00.000000000+0200']


In [38]:
#set / change the index name
df = makegridDF()
df.index.name = 'MyIndex'
print(df)
print(df.index.tolist()) # returns a list
print(df.index.values)  # returns an array

         A  B  C   D
MyIndex             
0        1  4  7  10
1        2  5  8  11
2        3  6  9  12
[0, 1, 2]
[0 1 2]


Use a column's values to set the index accordingly. Now that a repeat value appears to be allowed - strange.

In [39]:
df = makegridDF()
print(df)

df.loc[2,'A'] = 2
df.index = df.A
df.index.name = 'MyNewIndex'
print(df.loc[2,:])


   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
            A  B  C   D
MyNewIndex             
2           2  5  8  11
2           2  6  9  12


### Column properties

The column names can be retrieved as a [`pandas.index `](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html):

In [40]:
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

Get a list of the columns in dataframe - there are two ways to do this:

In [41]:
df = makeDateRand()
print(list(df.columns.values))
print(df.columns.values.tolist()) #fastest
print(list(df))

['A', 'B', 'C', 'D']
['A', 'B', 'C', 'D']
['A', 'B', 'C', 'D']


###DataFrame values

Get the values of the DataFrame contents as a Numpy array:

In [42]:
df = makeDateRand()
df.values

array([[-0.01306872, -0.29244687, -1.46897297, -1.52279137],
       [-0.09730003,  0.5258122 , -1.7660565 ,  0.66580562],
       [-2.31641091,  0.95014814,  0.17203528, -0.40020686],
       [ 0.40035317, -0.7199735 ,  1.12052782,  1.25586226],
       [-0.4157624 , -1.09001308,  0.14626769, -0.50606948],
       [-1.02902624, -0.42427113, -0.0748173 , -0.8837951 ]])

##NaN in Pandas / Numpy

Empty cells, or cells with missing data are filled with NaNs.  The example shows how to test for NaN values (`isnull()`).

In [43]:
a = np.nan
print(a)
print(pd.isnull(a))

nan
True


Use the `fillna()` function to fill NaN cells with some other value.

In [44]:
df = pd.DataFrame([[1, np.nan], [3, 4]], columns=list('AB'))
print(df)
print('\nNans = \n{}'.format(df.apply(lambda col: pd.isnull(col))))

#change all NaN to some other value
df.B = df.B.fillna('**')
print('\nNans replaced with ** = \n{}'.format(df))


   A   B
0  1 NaN
1  3   4

Nans = 
       A      B
0  False   True
1  False  False

Nans replaced with ** = 
   A   B
0  1  **
1  3   4


Creating a new dataframe by removing NaN in a row

In [45]:
df = pd.DataFrame([[1, np.nan], [3, 4]], columns=list('AB'))
print(df)
df = df[pd.notnull(df['B'])]
print(df)

   A   B
0  1 NaN
1  3   4
   A  B
1  3  4


##Manipulating DataFrames

###View of a DataFrame vs a Copy of a DataFrame

See [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy) for the full example.
           
More information on [multi-indexing](http://pandas-docs.github.io/pandas-docs-travis/advanced.html)

In [46]:
dfmi = createMultColIdx()
print(dfmi)

    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p


In the code below the first form `dfmi['one']['second']` is called the chained method, both using the `__getitem__` method, but happening in sequence.  The first call `dfmi['one']` returns a DataFrame which is input to the second call`(dfmi['one'])['second']` - these are two calls, one happening after the other.  

The second form `df.loc[:,('one','second')]`  passes a nested tuple to a single call to `__getitem__`, which can be significantly faster, and allows one to index both axes if so desired. Look at the name of the DataFrame returned in both cases and spot the difference.

In [47]:
print(dfmi['one']['second'])
print(dfmi.loc[:,('one','second')])


0    b
1    f
2    j
3    n
Name: second, dtype: object
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object


The first forms gives a `SettingWithCopyWarning` warning.  Since the chained indexing is 2 calls, it is possible that either call may return a copy of the data because of the way it is sliced. Thus when setting, you are actually setting a copy, and not the original frame data. 

The `.loc` operation is a single python operation, and thus can select a slice (which still may be a copy), but allows pandas to assign that slice back into the frame after it is modified, thus setting the values as you would think.

The reason for having the `SettingWithCopy` warning is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if it is sliced in a particular way. A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array.

In [48]:
dfmi['one']['second'] = 1 # assignment has no effect on the original!!
print(dfmi)

    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


To get the desired effect, use `.loc` to directly address the original DataFrame.  The `slice` method is used to select multiple column levels.

<http://pandas-docs.github.io/pandas-docs-travis/advanced.html#using-slicers>

In [49]:
dfmi.loc[:,slice('one','second')] = 1
print(dfmi)

    one          two       
  first second first second
0     1      1     c      d
1     1      1     g      h
2     1      1     k      l
3     1      1     o      p


###Dropping a row from a DataFrame

When using drop(), note the axis direction.   
- to drop a column axis=1  
- to drop a row axis=0  (default)

In [50]:
#drop a column
df = makegridDF()
#first drop the 'A' column
print(df.drop('A',axis=1))

   B  C   D
0  4  7  10
1  5  8  11
2  6  9  12


You can also use row indexes to drop columns.  The following example shows several index-based row drop methods.

In [51]:
df = makegridDF()
print(df)
print(df.drop(1)) # singe index
print(df.drop([1,2])) # list of indexes
print(df.drop(0,axis=0)) #single index explicit row selection
print(df.drop(df.index[[0,2]])) #list in the index 
for idx, row in df.iterrows():#iterate over all rows
    print(idx,row)
    df.drop(idx,inplace=True)
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
0  1  4  7  10
2  3  6  9  12
   A  B  C   D
0  1  4  7  10
   A  B  C   D
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
1  2  5  8  11
(0, A     1
B     4
C     7
D    10
Name: 0, dtype: int64)
(1, A     2
B     5
C     8
D    11
Name: 1, dtype: int64)
(2, A     3
B     6
C     9
D    12
Name: 2, dtype: int64)
Empty DataFrame
Columns: [A, B, C, D]
Index: []


In [52]:
#drop some of the rows
df = makeDateRand()
df = df.drop(df.index[[0,1,2,3]])
for idx, row in df.iterrows():
    print(idx,row)

(Timestamp('2013-01-05 00:00:00', offset='D'), A    1.053305
B    0.141954
C    0.330076
D   -0.323197
Name: 2013-01-05 00:00:00, dtype: float64)
(Timestamp('2013-01-06 00:00:00', offset='D'), A    1.121166
B   -0.306181
C    1.422362
D   -1.712045
Name: 2013-01-06 00:00:00, dtype: float64)


In [53]:
#drop row based on value in a column
df = makegridDF()
print(df)
print(df[df['A'] >= 3])
print(df[(df['A'] >= 2) & (df['B']<6)])

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
2  3  6  9  12
   A  B  C   D
1  2  5  8  11


###Concatenation or Appending rows or DataFrames

`df.shape[0]` returns the number of rows already in the DataFrame (zero-based), hence if `df.shape[0]` is used as a row index, it will point to a new row immediately beyond the current last row.  This is an easy way to add row(s) to an existing DataFrame. 

Rows can be added to the DataFrame by [setting with enlargement](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#setting-with-enlargement).  The `df.loc[i]` (location) construct points to row `i`, which need not be an existing row.

In [54]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df)
df.loc[df.shape[0]] = ['a','b'] # add a row immediately beyond the current last
df.loc[df.shape[0]] = [np.nan,'new!'] # add a row immediately beyond the current last
print(df)

   A  B
0  1  2
1  3  4
     A     B
0    1     2
1    3     4
2    a     b
3  NaN  new!


Append rows to a dataframe. See the [Appending](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-concatenation).  This examples makes a copy of one of the rows and append it to the DataFrame.  Note that in this case a `copy()` is required to create a new DataFrame which is modified before appending.

In [55]:
df = makeRand()
s = df.iloc[2].copy() # copy is required, otherwise a view is taken
s[2] = 1000
print(s)
df.append(s, ignore_index=True)

A      -0.320352
B      -0.572016
C    1000.000000
D      -1.198180
Name: 2, dtype: float64


Unnamed: 0,A,B,C,D
0,-0.307783,-0.229234,1.574565,0.520096
1,0.186255,0.944296,-0.890249,0.335446
2,-0.320352,-0.572016,3.187768,-1.19818
3,0.066428,1.775879,-0.011416,-0.426466
4,-0.320352,-0.572016,1000.0,-1.19818


The `append()` function can be used to add one more rows formed as DataFrames to an existing DataFrame.

In [56]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df = df.append(df2) # append row(s)
print(df)

   A  B
0  1  2
1  3  4
0  5  6
1  7  8


Concatenate some existing rows from the current DataFrame to itself.

In [57]:
df = makeRand()
df2 = pd.concat([df,df[2:4]])
df2

Unnamed: 0,A,B,C,D
0,1.818372,-0.671796,-1.137477,0.241236
1,1.176977,-0.644033,0.311845,1.560931
2,-0.031638,1.499439,0.63831,0.892017
3,-1.076519,-1.250539,1.522066,0.23186
2,-0.031638,1.499439,0.63831,0.892017
3,-1.076519,-1.250539,1.522066,0.23186


The following examples concatenates three views of a DataFrame.

In [58]:
# Concatenating pandas objects together
df = makeRand(10,4)
print(df)
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)


          A         B         C         D
0  0.073759  0.173090 -0.160249  0.494095
1  1.125235 -0.157869  0.676483  0.173637
2 -0.429321  2.064771  0.212448  1.251799
3  0.546685 -0.066516  0.529381  1.319544
4  0.976501  0.473518  1.130852  0.056720
5 -0.808577  0.567734 -0.110561  0.187893
6  0.924681 -1.975411 -0.619851 -0.932975
7  1.168138  0.406968 -0.084265  0.230360
8 -0.321919 -0.210099 -0.407302  0.757779
9 -0.157319  1.590124  0.908581  0.054799


Unnamed: 0,A,B,C,D
0,0.073759,0.17309,-0.160249,0.494095
1,1.125235,-0.157869,0.676483,0.173637
2,-0.429321,2.064771,0.212448,1.251799
3,0.546685,-0.066516,0.529381,1.319544
4,0.976501,0.473518,1.130852,0.05672
5,-0.808577,0.567734,-0.110561,0.187893
6,0.924681,-1.975411,-0.619851,-0.932975
7,1.168138,0.406968,-0.084265,0.23036
8,-0.321919,-0.210099,-0.407302,0.757779
9,-0.157319,1.590124,0.908581,0.054799


Concatenation stacks together rows from two arrays. In the example below the `df` array is concatenated with a 2x2 slice from `dfr`.  There are two observations from the code below:

- The column names are used when concatenating the rows. In the first example the column names are consistent and appended as expected. In the second example the row names do not exactly agree and cells with missing data are filled with NaN.
- The index data type of the concatenated DataFrame must be the same as the main DataFrame (hash error occurs otherwise).  For example if the index is the DateTime series, the concatenation will not work.

In [59]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
dfr = makegridDF()
print(dfr)
print('\nExample 1')
print('to be contatenated={}'.format(dfr.loc[1:2,['A','B']]))
df2 = pd.concat([df,dfr.loc[1:2,['A','B']]])
print(df2)

print('\nExample 2')
print('to be contatenated={}'.format(dfr.loc[1:2,['B','C']]))
df2 = pd.concat([df,dfr.loc[1:2,['B','C']]])
print(df2)

#this will not concatenate in the examples above - index of wrong type
# df = makeDateRand()
# df.loc['20130102':'20130104',['A','B']]

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12

Example 1
to be contatenated=   A  B
1  2  5
2  3  6
   A  B
0  1  2
1  3  4
1  2  5
2  3  6

Example 2
to be contatenated=   B  C
1  5  8
2  6  9
    A  B   C
0   1  2 NaN
1   3  4 NaN
1 NaN  5   8
2 NaN  6   9


###SQL style merges

See the [Database style joining](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-join).

In [60]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print(left)
print(right)
pd.merge(left, right, on='key')

   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5


Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


###Handling duplicated rows

Finding duplicate rows, where the values in all the columns must be duplicates.  You can not mark either the first or last duplicated row.  The second example creates a new DataFrame containing only the duplicated rows, counting the number of duplicated rows.

In [61]:
df2 = makegridDF()
df2.loc[1,'A'] = 1
df2.loc[1,'B'] = 1
df2.loc[0,'B'] = 1
df2['isdup'] = df2.duplicated(subset=['A','B'])
print(df2)
df2['isdup'] = df2.duplicated(subset=['A','B'], take_last=True)
print(df2)

# create a new dataframe with the repeated rows
df = df2[df2.duplicated(subset=['A','B'], take_last=True)] 
print(len(df))
print(df)

   A  B  C   D  isdup
0  1  1  7  10  False
1  1  1  8  11   True
2  3  6  9  12  False
   A  B  C   D  isdup
0  1  1  7  10   True
1  1  1  8  11  False
2  3  6  9  12  False
1
   A  B  C   D isdup
0  1  1  7  10  True


The next example only checks for duplicates in column 'A' and then delete these row(s) from the DataFrame.

In [62]:
df2.drop_duplicates(subset=['A'], take_last=True, inplace=True)
df2

Unnamed: 0,A,B,C,D,isdup
1,1,1,8,11,False
2,3,6,9,12,False


The index value of any arbitrary row can be changed by making a list of the index, changing the value in the list and then re-assigning the list back to the DataFrame.

In [63]:
df2 = makegridDF()
df2.index = df2.index.tolist()[:-1]   + ['New Idx Value']
print(df2)

               A  B  C   D
0              1  4  7  10
1              2  5  8  11
New Idx Value  3  6  9  12


###Transpose a DataFrame

In [64]:
df = makegridDF()
print(df.T)
print(df.index)
print(df.columns)

    0   1   2
A   1   2   3
B   4   5   6
C   7   8   9
D  10  11  12
Int64Index([0, 1, 2], dtype='int64')
Index([u'A', u'B', u'C', u'D'], dtype='object')


###Selecting a subset of columns from a DataFrame

In [65]:
df = makegridDF()
print(df)
print(df.A)
print(df['A'])
print(df[['A','B']])

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
0    1
1    2
2    3
Name: A, dtype: int64
0    1
1    2
2    3
Name: A, dtype: int64
   A  B
0  1  4
1  2  5
2  3  6


Selecting a subset of columns may result in a copy or a view of the original DataFrame.  In this example a copy is made when selecting the columns, but a warning ensues when you try to assign a value to an element.  This warning arises because in some cases a view into the original DataFrame is returned and pandas cannot always know which form is used.

<http://stackoverflow.com/questions/11285613/selecting-columns>  

In [66]:
df = makegridDF()
df1 = df[['A','B']]
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


The more secure way (also get rid of the warning) to build a new DataFrame with a selection of columns is as follows:

In [67]:
df = makegridDF()
df1 = df.loc[:,['A','B']]
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


Another way would be to use `ix`.  However in this case a view is returned, which means that changing `df1` also changes the original DataFrame.

In [68]:
df = makegridDF()
# df1 = df.ix[:,slice('A','B')] # this and the following have same effect.
df1 = df.ix[:,0:2]  # this and the previous have same effect.
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
     A  B  C   D
0    1  4  7  10
1    2  5  8  11
2  100  6  9  12


A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


To force a copy of the original, use the `copy` method.

In [69]:
df = makegridDF()
df1 = df.ix[:,slice('A','B')].copy() # this and the following have same effect.
# df1 = df.ix[:,0:2].copy()  # this and the previous have same effect.
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


###Adding a column to a dataframe

It is relatively easy to add a column to an existing data frame. 

In [70]:
df = makeDateRand()
df['Total'] = df['A'] + df['B'] + df['C']
print(df)

                   A         B         C         D     Total
2013-01-01 -2.000762 -0.693116 -0.791277 -1.189644 -3.485155
2013-01-02  1.289905  1.290522  0.151470 -1.330298  2.731897
2013-01-03  0.339837 -0.137046  1.456081 -1.067535  1.658873
2013-01-04 -0.522663 -1.011610  0.663339  0.955362 -0.870934
2013-01-05  1.501320 -0.251211  0.579976 -0.131052  1.830085
2013-01-06  0.731628 -0.686850  0.171400  2.036577  0.216178


String concatenation can be used across columns.

In [71]:
df = makegAlphaDF()
print(df)
df2 = df['A'] + df['B']
print(df2)

    A   B   C   D
0  0a  0b  0c  0d
1  1a  1b  1c  1d
2  2a  2b  2c  2d
0    0a0b
1    1a1b
2    2a2b
dtype: object


###Delete rows based on column value

DataFrame has an `isin` method. When calling `isin`, pass a set of values as either an array or dict. If values is an array, isin returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.

<http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-with-isin>  
<http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing>  

In the example below delete all rows where the value in column 'A' is in a given list.

In [72]:
df = makegridDF()
print(df)
idx = df['A'].isin([1,3])
df = df[~idx]
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
1  2  5  8  11


To match certain values in certain columns make a dict where the key is the column, and the value is a list of items you want to check for.  Combine DataFrame’s `isin()` with the `any()` and `all()` methods to quickly select subsets of your data that meet a given criteria. To select a row where each column meets its own criterion.

In the first example, remove all rows where the requirements for __all__ of the tests are met ('A' has 1 or 2, 'B' has 5 or 6, 'C' has 8 and 'D' has 11).

<http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.all.html#numpy.ndarray.all>

In [73]:
df = makegridDF()
print(df)
idx = df.isin({'A': [1,2], 'B': [5,6], 'C': [8], 'D': [11]})
print(idx)
idx = idx.all(axis=1)
print(idx)
df = df[~idx]
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
       A      B      C      D
0   True  False  False  False
1   True   True   True   True
2  False   True  False  False
0    False
1     True
2    False
dtype: bool
   A  B  C   D
0  1  4  7  10
2  3  6  9  12


In the second example, drop all rows where the requirements for __any__ of the tests are met ('A' has 1 or 2, 'B' has 5 or 6, 'C' has 8 and 'D' has 11).  In this case, it would drop all rows from the DataFrame, leaving it empty.

In [74]:
df = makegridDF()
print(df)
idx = df.isin({'A': [1,2], 'B': [5,6]})
print(idx)
idx = idx.any(axis=1)
print(idx)
df = df[~idx]
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
       A      B      C      D
0   True  False  False  False
1   True   True  False  False
2  False   True  False  False
0    True
1    True
2    True
dtype: bool
Empty DataFrame
Columns: [A, B, C, D]
Index: []


###Sorting

Sort by column index (axis=1), i.e., rearrange column order.

In [75]:
df = makeDateRand()
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-0.365891,1.1737,-1.914599,0.083901
2013-01-02,-0.627845,-2.349284,0.17996,-0.814662
2013-01-03,0.063058,-0.31969,-0.320729,-1.215805
2013-01-04,-0.518339,-0.245213,1.607681,-0.564651
2013-01-05,-0.161537,-0.847865,1.472124,1.263288
2013-01-06,-0.773052,-0.485287,-1.38891,-0.483969


Sort all rows by row value in column B.

In [76]:
df.sort(columns='B')

Unnamed: 0,A,B,C,D
2013-01-01,0.083901,-1.914599,1.1737,-0.365891
2013-01-06,-0.483969,-1.38891,-0.485287,-0.773052
2013-01-03,-1.215805,-0.320729,-0.31969,0.063058
2013-01-02,-0.814662,0.17996,-2.349284,-0.627845
2013-01-05,1.263288,1.472124,-0.847865,-0.161537
2013-01-04,-0.564651,1.607681,-0.245213,-0.518339


Sort all rows by row value in multiple columns.

In [77]:
df.sort(columns=['B', 'C'])

Unnamed: 0,A,B,C,D
2013-01-01,0.083901,-1.914599,1.1737,-0.365891
2013-01-06,-0.483969,-1.38891,-0.485287,-0.773052
2013-01-03,-1.215805,-0.320729,-0.31969,0.063058
2013-01-02,-0.814662,0.17996,-2.349284,-0.627845
2013-01-05,1.263288,1.472124,-0.847865,-0.161537
2013-01-04,-0.564651,1.607681,-0.245213,-0.518339


You can introduce custom sorting by using categoricals.  In this example, first sort the 'G' column on default sorting (alphabetical).  Then redefine the 'G' column as a categorical with a specific sort order. Then re-sort, using the categorical sort order.

In [78]:
df = makecatedf()
print(df)
print(df.sort(columns='G'))
      
gsorter = ['fooz','fooa','foox','foon']
df.G = df.G.astype("category")
df.G.cat.set_categories(gsorter, inplace=True) 
print(df.sort(columns='G'))


   A          B  C  D      E    F     G
0  1 2013-01-02  1  3   test  foo  foox
1  1 2013-01-02  1  3  train  foo  fooa
2  1 2013-01-02  1  3   test  foo  foon
3  1 2013-01-02  1  3  train  foo  fooz
   A          B  C  D      E    F     G
1  1 2013-01-02  1  3  train  foo  fooa
2  1 2013-01-02  1  3   test  foo  foon
0  1 2013-01-02  1  3   test  foo  foox
3  1 2013-01-02  1  3  train  foo  fooz
   A          B  C  D      E    F     G
3  1 2013-01-02  1  3  train  foo  fooz
1  1 2013-01-02  1  3  train  foo  fooa
0  1 2013-01-02  1  3   test  foo  foox
2  1 2013-01-02  1  3   test  foo  foon


###Slicing and selecting sub-arrays

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, use the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix. 

[Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing)  
[MultiIndex / Advanced Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced)

The [pandas site](http://pandas.pydata.org/pandas-docs/stable/indexing.html) offers the following description:

Object selection has had a number of user-requested additions in order to support more explicit location based indexing. pandas now supports three types of multi-axis indexing.

1.    `.ix` supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type. `.ix` is the most general and will support any of the inputs in `.loc` and `.iloc`. `.ix` also supports floating point label schemes. .ix is exceptionally useful when dealing with mixed positional and label based hierarchical indexes.      However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use `.iloc` or `.loc`.    
       `.ix` does not and cannot guarantee that the label versus integer position resolution is perfect - you may run into [problems](https://github.com/pydata/pandas/issues/6683)  here.  `.ix` is an older method than than `.loc` and `.iloc` and was introduced to specifically prevent ambiguity by using stricted rules on data selection. `.ix` is faster than than `.loc` and `.iloc`

     See more at [Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced) and [Advanced Hierarchical](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-advanced-hierarchical).

1.    `.loc` is primarily label based, but may also be used with a boolean array. `.loc` will raise `KeyError` when the items are not found. Allowed inputs are:
       * A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
       * A list or array of labels ['a', 'b', 'c']
       * A slice object with labels 'a':'f', (note that contrary to usual python slices, **both** the start and the stop are included!)
       * A boolean array

     See more at [Selection by Label](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label)

1.     `.iloc` is primarily integer position based (from `0` to `length-1` of the axis), but may also be used with a boolean array. `.iloc`  will raise `IndexError` if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with python/numpy slice semantics). Allowed inputs are:
       * An integer e.g. 5
       * A list or array of integers [4, 3, 0]
       * A slice object with ints 1:7
       * A boolean array

     See more at [Selection by Position](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer)


Getting values from an object with multi-axes selection uses the following notation (using `.loc` as an example, but applies to `.iloc` and `.ix` as well). Any of the axes accessors may be the null slice `:`. Axes left out of the specification are assumed to be `:`. (e.g. `p.loc['a']` is equiv to `p.loc['a', :, :]`)

|Object Type |	Indexers|
|--|--|
|Series 	|`s.loc[indexer]`|
|DataFrame 	|`df.loc[row_indexer,column_indexer]`|
|Panel 	|`p.loc[item_indexer,major_indexer,minor_indexer]`|


<http://nbviewer.ipython.org/github/gboeing/python-cheat-sheets/blob/master/pandas-selecting.ipynb>  


###Conventional selection by column/index name

Selecting a single column with the form `df['A']`, yields a Series, equivalent to df.A.  
To select multiple columns  pass a list of column names as in `df[ ['A','B'] ]`.

In [79]:
df = makegridDF()
print(df.A)
print(df['A'])
print(df[['A','B']])

0    1
1    2
2    3
Name: A, dtype: int64
0    1
1    2
2    3
Name: A, dtype: int64
   A  B
0  1  4
1  2  5
2  3  6


Extract the Numpy array from the series in one of these two ways:

In [80]:
print(np.asarray(df['A']))
print(df['A'].values)

[1 2 3]
[1 2 3]


Slice rows using `df[]`, using index values or row numbers.   The row sequence can use slice notation, note that the upper bound is not included.  

This is the same form as used for columns above - somewhat confusing!

In [81]:
df = makeDateRand()
print(df)
print(df[2:4])

                   A         B         C         D
2013-01-01 -0.049537 -0.442201 -1.970254  0.862714
2013-01-02  0.087023 -0.492432  0.248465  0.600162
2013-01-03 -0.723006 -0.336371  1.675453  0.717462
2013-01-04  0.292548  0.222891 -1.081402  0.414287
2013-01-05 -0.756815 -1.527111  0.813545  0.491031
2013-01-06 -0.792586 -1.027306 -0.158722  0.560602
                   A         B         C         D
2013-01-03 -0.723006 -0.336371  1.675453  0.717462
2013-01-04  0.292548  0.222891 -1.081402  0.414287


In [82]:
print(df['2013-01-01':'2013-01-02'])

                   A         B         C         D
2013-01-01 -0.049537 -0.442201 -1.970254  0.862714
2013-01-02  0.087023 -0.492432  0.248465  0.600162


###[`.ix` Conventional selection by label or position](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-integer)

You can use `ix` to select slices of the data frame.  

In [83]:
df = makeDateRand()
print(df)
print('')
print(df.ix[:, 'D']) # All rows in column D
print(df.ix[0:2, 0:2]) # upper left 2x2 sub-array, not including third [2] column 
print(df.ix[0:2, [0,2,3]]) # multiple columns in list format
print(df.ix[1:3, 'A':'C']) # use range of column names, same effect as above, note 'C' included!!
print(df.ix[2:4, ['A','C']]) # use list of column names
print(df.ix[1:3, 'B':]) # All columns onwards from 'B'
print(df.ix[1:3, :'C']) # All columns up to and including!! C
df.ix[1:3, :'C'] = -1
print(df)

                   A         B         C         D
2013-01-01  0.057676  0.558993 -1.591971 -0.690108
2013-01-02 -1.179716  1.187584 -0.093978  0.870272
2013-01-03  0.290115  1.238796  1.044845  0.307820
2013-01-04  0.356530 -0.506842 -1.480214  1.201315
2013-01-05 -0.878509  2.323929  0.496367 -0.539124
2013-01-06 -0.772658 -0.261086  1.875940  0.884133

2013-01-01   -0.690108
2013-01-02    0.870272
2013-01-03    0.307820
2013-01-04    1.201315
2013-01-05   -0.539124
2013-01-06    0.884133
Freq: D, Name: D, dtype: float64
                   A         B
2013-01-01  0.057676  0.558993
2013-01-02 -1.179716  1.187584
                   A         C         D
2013-01-01  0.057676 -1.591971 -0.690108
2013-01-02 -1.179716 -0.093978  0.870272
                   A         B         C
2013-01-02 -1.179716  1.187584 -0.093978
2013-01-03  0.290115  1.238796  1.044845
                   A         C
2013-01-03  0.290115  1.044845
2013-01-04  0.356530 -1.480214
                   B         C         

To copy discontinuous column ranges takes a bit more effort. First create lists of the required columns

In [84]:
df = makeDateRand()
lst = list(df.columns[0:1]) + list(df.columns[2:3])
print(lst)
df1 = df[lst].copy() # copy was made, use this to get rid of the warning 
df1.ix[2,0] = +1000
print(df1)

df2 = df.ix[:,lst] # ix appears to have made a copy
df2.ix[2,0] = +1000
print(df2)
print(df)



['A', 'C']
                      A         C
2013-01-01    -1.443114 -0.206099
2013-01-02    -0.369420 -1.494783
2013-01-03  1000.000000 -0.027470
2013-01-04    -1.143143 -1.681478
2013-01-05    -0.492624  0.986621
2013-01-06     1.702025  3.081244
                      A         C
2013-01-01    -1.443114 -0.206099
2013-01-02    -0.369420 -1.494783
2013-01-03  1000.000000 -0.027470
2013-01-04    -1.143143 -1.681478
2013-01-05    -0.492624  0.986621
2013-01-06     1.702025  3.081244
                   A         B         C         D
2013-01-01 -1.443114 -1.808870 -0.206099 -0.942299
2013-01-02 -0.369420  0.895848 -1.494783 -0.282347
2013-01-03 -0.104783  0.800126 -0.027470  0.539052
2013-01-04 -1.143143  2.292103 -1.681478 -1.584916
2013-01-05 -0.492624 -0.630026  0.986621 -0.603984
2013-01-06  1.702025 -0.653170  3.081244  0.254265


###[`.loc` Selection by Label](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-label)

The following example is strange in the sense that is refers to the index by name (see the function where the DataFrame was created), but the index is not named.  Yet, it can/must be used, by using the Series name `dates`. This is probably because pandas  has strong support for time Series.

In [1]:
df = makeDateRand()
print(df)
print(df.index)
print(df.index.name)
print(df.columns)
print(df.loc[dates[0]])

NameError: name 'makeDateRand' is not defined

In this example the row at count=0 is accessed just by the count number.  The index is not named.

In [None]:
df = makegridDF()
print(df)
print(df.index)
print(df.index.name)
print(df.loc[0])

Select all the rows, but only the 'A' and 'B' columns of these rows.

In [None]:
df.loc[:,['A','B']]

In the following example a slice is made on both rows and columns.  Note that when using `loc` both endpoints in the row range are returned, but in die `ix` case the upper bound must point to one beyond the end row.

In [None]:
df = makeDateRand()
print(df.loc['20130102':'20130104',['A','B']])
print(df.ix[1:4,['A','B']])

This example selects a row by using a dynamically generated datetime value.

In [None]:
df = makeDateRand()
df.ix[datetime(2013,01,02)]

Rows can also be selected by numeric index by using the `irow` method.

In [None]:
df = makeDateRand()
print(df)
print(df.irow(1))
print(df.irow(3))

This example iterates over all rows, assigning values to each row during iteration.

In [None]:
df = makeDateRand()
print(df.head())
for i,(idx, row) in enumerate(df.iterrows()):
    row['A'] = 2
    df.ix[idx, 'B'] = i
    df.ix[idx]['C'] = np.sqrt(i)
print(df.head())

In [86]:
#tbc

###[`.iloc` Selection by Position](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-integer)

The `iloc` 

In [87]:
df = makeDateRand()
df.iloc[3]

A   -0.076820
B   -0.363033
C   -0.816235
D    1.931114
Name: 2013-01-04 00:00:00, dtype: float64

In [88]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,-0.07682,-0.363033
2013-01-05,-0.453421,-1.748829


In [89]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,0.500771,-0.579059
2013-01-03,1.781403,1.678803
2013-01-05,-0.453421,0.999213


In [90]:
#slicing rows
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2013-01-02,0.500771,-0.648795,-0.579059,1.478035
2013-01-03,1.781403,0.32885,1.678803,1.058212


In [91]:
#slicing columns
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,0.13547,1.985935
2013-01-02,-0.648795,-0.579059
2013-01-03,0.32885,1.678803
2013-01-04,-0.363033,-0.816235
2013-01-05,-1.748829,0.999213
2013-01-06,1.952574,-1.615337


In [92]:
df.iloc[1,1]

-0.64879466137027653

###Series/DataFrame [enlargement](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#setting-with-enlargement)

The `.loc`/`.ix/[]` operations can perform enlargement when setting a non-existent key for that axis. In the Series case this is effectively an appending operation.

In [93]:
se = pd.Series([1,2,3])
print(se) 
se[5] = 5.
print(se)

0    1
1    2
2    3
dtype: int64
0    1
1    2
2    3
5    5
dtype: float64


A DataFrame can be enlarged on either axis via `.loc`

In [94]:
dfi = pd.DataFrame(np.arange(6).reshape(3,2),columns=['A','B'])
print(dfi)
dfi.loc[:,'C'] = dfi.loc[:,'A']
print(dfi)
dfi.loc[3] = 5
print(dfi)

   A  B
0  0  1
1  2  3
2  4  5
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5


###Find row where index is nearest to given value

In [95]:
df = makeDateRand()
print(df)
print(df.iloc[np.argmin(np.abs(df.index.to_pydatetime() - datetime.datetime(2013,1,4)))]) # row
print(np.argmin(np.abs(df.index.to_pydatetime() - datetime.datetime(2013,1,4)))) # index

                   A         B         C         D
2013-01-01  0.189437  1.968306 -0.522984 -0.666837
2013-01-02 -0.242677  0.100030  2.273329 -0.514808
2013-01-03  0.247026  1.102906 -1.309020 -0.260669
2013-01-04  0.203573 -1.968246  0.554815  0.056351
2013-01-05 -0.258134 -0.924069  1.358362  0.185336
2013-01-06 -0.511529 -1.546994 -1.382004 -0.510906
A    0.203573
B   -1.968246
C    0.554815
D    0.056351
Name: 2013-01-04 00:00:00, dtype: float64
3


In [96]:
df = makeRand()
print(df)
row = df.iloc[np.argmin(np.abs(df.index - 2))] # row
print(type(row))
print(row)
print(np.argmin(np.abs(df.index - 2))) # index

          A         B         C         D
0 -0.119847  0.025491  1.390878 -2.482371
1 -0.325850 -1.306201  0.019596  0.673565
2  0.546756 -0.447161  0.341329  0.353256
3  0.822402 -1.071836 -1.742314 -0.559044
<class 'pandas.core.series.Series'>
A    0.546756
B   -0.447161
C    0.341329
D    0.353256
Name: 2, dtype: float64
2


###Find row where column is maximum

In [97]:
df = makeRand()
print(df)
print(df['A'].argmax(df['A'].argmax()))  # index
print(df.iloc[df['A'].argmax(df['A'].argmax())]) #row

          A         B         C         D
0  0.914431 -0.876124 -0.424277 -0.462927
1 -0.762158 -0.071080  0.406489 -1.499572
2 -1.903506  0.222047 -0.624086  0.480266
3  0.050370  0.063461 -1.321908 -0.974930
0
A    0.914431
B   -0.876124
C   -0.424277
D   -0.462927
Name: 0, dtype: float64


###Find row where specific column has nearest value

In [98]:
df = makeRand()
print(df)
value = 0
print(df.iloc[np.argmin(np.abs(df['A'] - value))]) # row
print(np.argmin(np.abs(df['A'] - value))) # index

          A         B         C         D
0 -0.856433  0.891878 -0.528332 -1.112698
1  0.015759  0.037903  0.003651  0.051812
2  1.743219  0.835476  0.964045  0.468997
3 -0.098828  0.672620  1.415898 -0.128546
A    0.015759
B    0.037903
C    0.003651
D    0.051812
Name: 1, dtype: float64
1


###Boolean indexing and filtering

In [99]:
#filter by single row
df = makeDateRand()
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-02,0.080407,-0.690036,-0.708607,0.925353
2013-01-03,0.760107,-0.761684,-0.613358,-0.245375
2013-01-04,2.21878,-0.008792,-1.279419,2.019393
2013-01-05,0.015791,-1.157923,-1.439321,-0.72175
2013-01-06,1.78483,1.164756,0.295163,0.414056


In [100]:
#filter by multiple row
df2 = df[(df.A>0) & (df.B>0)]
df2

Unnamed: 0,A,B,C,D
2013-01-06,1.78483,1.164756,0.295163,0.414056


In [101]:
#filter specs are pandas time series, which can be manipulated
filt = (df.A>0) & (df.B>0)
print(type(filt), filt)
print('filt.any() = {}'.format(filt.any()))
print('filt.all() = {}'.format(filt.all()))

(<class 'pandas.core.series.Series'>, 2013-01-01    False
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-05    False
2013-01-06     True
Freq: D, dtype: bool)
filt.any() = True
filt.all() = False


In [102]:
#filter by element
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,,0.067313,
2013-01-02,0.080407,,,0.925353
2013-01-03,0.760107,,,
2013-01-04,2.21878,,,2.019393
2013-01-05,0.015791,,,
2013-01-06,1.78483,1.164756,0.295163,0.414056


In [103]:
#isin filtering
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
print(df2)
df2[df2['E'].isin(['two','four'])]

                   A         B         C         D      E
2013-01-01 -0.243915 -0.543940  0.067313 -0.898526    one
2013-01-02  0.080407 -0.690036 -0.708607  0.925353    one
2013-01-03  0.760107 -0.761684 -0.613358 -0.245375    two
2013-01-04  2.218780 -0.008792 -1.279419  2.019393  three
2013-01-05  0.015791 -1.157923 -1.439321 -0.721750   four
2013-01-06  1.784830  1.164756  0.295163  0.414056  three


Unnamed: 0,A,B,C,D,E
2013-01-03,0.760107,-0.761684,-0.613358,-0.245375,two
2013-01-05,0.015791,-1.157923,-1.439321,-0.72175,four


In [104]:
#get unique values in a column
df = makefoobar()
print(df)
df.B.unique()

     A      B         C         D
0  foo    one  0.052574 -0.613090
1  bar    one  1.027308  0.627374
2  foo    two  0.429571 -0.118539
3  bar  three -1.453048 -0.935133
4  foo    two -2.016775  0.081391
5  bar    two  1.268818 -0.327257
6  foo    one -0.864569  0.847683
7  foo  three  1.893796  0.039157


array(['one', 'two', 'three'], dtype=object)

<http://stackoverflow.com/questions/20875140/apply-function-to-sets-of-columns-in-pandas-looping-over-entire-data-frame-co>  

What I want to do is simply to calculate the length of the vector for each header (A and B) in this case, for each index, and divide by the Time column. Hence, this function needs to be np.sqrt(A_x^2 + A_y^2 + A_z^2) and the same for B of course. I.e. I am looking to calculate the velocity for each row, but three columns contribute to one velocity result.      

In [105]:
#pandas approach
headers = ['Time', 'A_x', 'A_y', 'A_z', 'B_x', 'B_y', 'B_z']
df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

#fiter the column names to get a list of the ones you need
print(filter(lambda x: x.startswith("A_"),df.columns))

#get the columns according to names
print(df[filter(lambda x: x.startswith("A_"),df.columns)])

# do the apply dot product for each row across columns
column_initials = ["A","B"]
for column_initial in column_initials:
    df["Velocity_"+column_initial] = \
    df[filter(lambda x: x.startswith(column_initial+"_"),df.columns)].apply(lambda x: np.sqrt(x.dot(x)), axis=1)/df.Time
print(df)  


['A_x', 'A_y', 'A_z']
         A_x       A_y       A_z
1   1.274422 -0.074900 -1.045540
2   0.601166 -1.538175  0.138731
3  -0.590328  0.523427  0.908862
4   0.039466 -1.053585  0.576629
5   1.136747  2.863533  1.000475
6   1.341582  0.548275 -0.609797
7   0.328577  0.780283 -0.415993
8  -1.107100  1.087305 -1.297485
9  -0.554681  1.485139 -0.502074
10 -1.150768 -0.739333  2.356498
        Time       A_x       A_y       A_z       B_x       B_y       B_z  Velocity_A  Velocity_B
1  -0.824695  1.274422 -0.074900 -1.045540  0.279494 -0.660912  1.736886   -2.000892   -2.278759
2  -0.233684  0.601166 -1.538175  0.138731  1.636022 -2.410247  1.293684   -7.092031  -13.639748
3   1.564057 -0.590328  0.523427  0.908862 -0.293161  0.327054 -0.244306    0.769495    0.321335
4   0.903773  0.039466 -1.053585  0.576629  1.763697 -0.106925 -0.390330    1.329656    2.002201
5  -0.489718  1.136747  2.863533  1.000475  0.420421 -1.065808  0.482030   -6.614597   -2.538201
6   0.500923  1.341582  0.548275 

In [106]:
#numpy approach
headers = ['Time', 'A_x', 'A_y', 'A_z', 'B_x', 'B_y', 'B_z']
df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

arr = df.values
times = arr[:,0]
arr = arr[:,1:]
result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in list('AB')])
print(result)

   Velocity_A  Velocity_B
0    2.563454    3.758577
1    0.223032    1.404826
2    2.703913    1.694411
3    3.188365    2.945370
4   -0.160827   -0.726197
5   -2.193410   -1.100944
6   -1.531621   -1.712556
7    2.708106    2.057796
8    3.172638    4.060869
9    3.105611    1.820739


In [107]:
# yet another approach
headers = ['Time', 'A_x', 'A_y', 'A_z', 'B_x', 'B_y', 'B_z']
df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

result = df\
    .loc[:, df.columns!='Time']\
    .groupby(lambda x: x[0], axis=1)\
    .apply(lambda x: np.sqrt((x**2).sum(1)))\
    .apply(lambda x: x / df['Time'])

print(result)

            A          B
1   -0.649460  -0.799907
2  -17.659506 -13.504172
3    4.263316  14.890226
4    0.811496   1.529205
5   -2.424726  -3.881509
6   -2.249967  -2.542003
7   -3.141124  -1.482995
8    5.004939   3.854648
9   -2.721320  -2.436191
10  -1.165735  -1.842153


###Setting data

In [108]:
#Adding the sum along a column
df = makeDateRand()
df['A'].sum(), df['B'].sum(), df['C'].sum(), 

(0.5179013623765184, 3.8148887812136141, 0.8382415036510682)

In [109]:
df = makeDateRand()
df['Total'] = df['A'] + df['B'] + df['C']
print(df)

                   A         B         C         D     Total
2013-01-01 -0.328146  0.029867  0.485955 -0.280282  0.187677
2013-01-02 -0.020318  0.340067  1.112996 -0.195323  1.432745
2013-01-03 -0.834466  0.280605  1.106838  0.767093  0.552978
2013-01-04 -1.874333  0.817828 -0.671192 -1.042477 -1.727697
2013-01-05  1.367361  1.647769  0.982360  2.303161  3.997490
2013-01-06  1.903007  2.175426  1.417710  0.317822  5.496143


In [110]:
sum_row = df[['A','B','Total']].sum()
sum_row

A        0.213105
B        5.291564
Total    9.939336
dtype: float64

We need to transpose the data and convert the Series to a DataFrame so that it is easier to concat onto our existing data. The T function allows us to switch the data from being row-based to column-based.



In [111]:
df_sum=pd.DataFrame(data=sum_row).T
df_sum

Unnamed: 0,A,B,Total
0,0.213105,5.291564,9.939336


The final thing we need to do before adding the totals back is to add the missing columns. We use reindex to do this for us. The trick is to add all of our columns and then allow pandas to fill in the values that are missing.


In [112]:
df_sum=df_sum.reindex(columns=df.columns)
df_sum

Unnamed: 0,A,B,C,D,Total
0,0.213105,5.291564,,,9.939336


Now append the totals to the end of the dataframe, rename the index value to use the word 'Total'.

In [113]:
df=df.append(df_sum,ignore_index=True)
df.index = df.index.tolist()[:-1]   + ['Total']
df.tail()

Unnamed: 0,A,B,C,D,Total
2,-0.834466,0.280605,1.106838,0.767093,0.552978
3,-1.874333,0.817828,-0.671192,-1.042477,-1.727697
4,1.367361,1.647769,0.98236,2.303161,3.99749
5,1.903007,2.175426,1.41771,0.317822,5.496143
Total,0.213105,5.291564,,,9.939336


Setting a new column automatically aligns the data by the indexes

In [114]:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
df2['F'] = s1
df2

Unnamed: 0,A,B,C,D,E,F
2013-01-01,-0.243915,-0.54394,0.067313,-0.898526,one,
2013-01-02,0.080407,-0.690036,-0.708607,0.925353,one,1.0
2013-01-03,0.760107,-0.761684,-0.613358,-0.245375,two,2.0
2013-01-04,2.21878,-0.008792,-1.279419,2.019393,three,3.0
2013-01-05,0.015791,-1.157923,-1.439321,-0.72175,four,4.0
2013-01-06,1.78483,1.164756,0.295163,0.414056,three,5.0


In [115]:
# Setting values by label
df.at[dates[0],'A'] = 0
df

NameError: name 'dates' is not defined

In [116]:
# Setting values by position
df.iat[0,1] = 7
df

Unnamed: 0,A,B,C,D,Total
0,-0.328146,7.0,0.485955,-0.280282,0.187677
1,-0.020318,0.340067,1.112996,-0.195323,1.432745
2,-0.834466,0.280605,1.106838,0.767093,0.552978
3,-1.874333,0.817828,-0.671192,-1.042477,-1.727697
4,1.367361,1.647769,0.98236,2.303161,3.99749
5,1.903007,2.175426,1.41771,0.317822,5.496143
Total,0.213105,5.291564,,,9.939336


In [117]:
# Setting by assigning with a numpy array
df.loc[:,'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,D,Total
0,-0.328146,7.0,0.485955,5,0.187677
1,-0.020318,0.340067,1.112996,5,1.432745
2,-0.834466,0.280605,1.106838,5,0.552978
3,-1.874333,0.817828,-0.671192,5,-1.727697
4,1.367361,1.647769,0.98236,5,3.99749
5,1.903007,2.175426,1.41771,5,5.496143
Total,0.213105,5.291564,,5,9.939336


In [118]:
# A where operation with setting.
df = makeDateRand()
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D
2013-01-01,-0.144325,-2.334306,-0.122322,-1.329312
2013-01-02,-0.769916,-1.145912,-0.147902,-0.371129
2013-01-03,-1.013204,-0.827255,-0.161043,-0.411718
2013-01-04,-0.014626,-0.163666,-0.804304,-2.244476
2013-01-05,-0.508525,-0.03954,-0.104592,-0.913627
2013-01-06,-0.007214,-1.091979,-1.140243,-0.412641


##Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the [Missing Data section](http://pandas.pydata.org/pandas-docs/version/0.15.2/missing_data.html#missing-data)

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [119]:
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

NameError: name 'dates' is not defined

To drop any rows that have missing data.

In [120]:
df1.dropna(how='any')

Unnamed: 0,A,C
2013-01-01,-1.443114,-0.206099
2013-01-02,-0.36942,-1.494783
2013-01-03,1000.0,-0.02747
2013-01-04,-1.143143,-1.681478
2013-01-05,-0.492624,0.986621
2013-01-06,1.702025,3.081244


Filling missing data

In [121]:
df1.fillna(value=5)

Unnamed: 0,A,C
2013-01-01,-1.443114,-0.206099
2013-01-02,-0.36942,-1.494783
2013-01-03,1000.0,-0.02747
2013-01-04,-1.143143,-1.681478
2013-01-05,-0.492624,0.986621
2013-01-06,1.702025,3.081244


To get the boolean mask where values are nan

In [122]:
pd.isnull(df1)

Unnamed: 0,A,C
2013-01-01,False,False
2013-01-02,False,False
2013-01-03,False,False
2013-01-04,False,False
2013-01-05,False,False
2013-01-06,False,False


##Operations
###Binary operations

See the Basic section on [Binary Ops](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-binop)  

Operations in general exclude missing data.

In [123]:
df.mean()

A   -0.235251
B   -0.155674
C    0.110436
D    0.199504
dtype: float64

In [124]:
#along the other axis
df.mean(1)

2013-01-01    0.245748
2013-01-02   -0.349199
2013-01-03   -0.316925
2013-01-04    0.322783
2013-01-05   -0.137308
2013-01-06    0.113422
Freq: D, dtype: float64

###Applying functions to the data

When using apply(), note the axis direction.   
- for each column, apply down a row: axis=0  (default)
- for each row, apply across columns: axis=1.

In [125]:
df = makegridDF()
print(df)
print(df.apply(np.cumsum))
print(df.apply(np.cumsum, axis=0))
print(df.apply(np.cumsum, axis=1))

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A   B   C   D
0  1   4   7  10
1  3   9  15  21
2  6  15  24  33
   A   B   C   D
0  1   4   7  10
1  3   9  15  21
2  6  15  24  33
   A  B   C   D
0  1  5  12  22
1  2  7  15  26
2  3  9  18  30


In [126]:
df = makeDateRand()
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D
2013-01-01,-0.278831,-0.297304,-0.754138,2.458566
2013-01-02,-0.91526,-0.199001,0.617264,2.723362
2013-01-03,-0.580445,1.749183,1.661473,2.557722
2013-01-04,-1.383781,-0.811856,2.381439,0.469024
2013-01-05,-2.320366,-0.941746,1.728593,1.202142
2013-01-06,-3.336905,-0.546717,1.494484,1.415916


In [127]:
df = makeDateRand()
print(df)
df.apply(lambda x: x.max() - x.min())

                   A         B         C         D
2013-01-01  1.849118 -0.222530  0.976040  0.006137
2013-01-02  0.519233  1.117980  0.213673 -0.962342
2013-01-03 -1.685890  0.041302  0.343944  2.227856
2013-01-04  1.682176  1.308816  0.970347  0.403449
2013-01-05  0.634276  1.097843  1.429936 -0.529867
2013-01-06  0.558784 -2.154713 -0.245882 -0.298425


A    3.535007
B    3.463529
C    1.675818
D    3.190197
dtype: float64

In [128]:
from datetime import datetime
df = makeDateRand()
df.index.name = 'Date'
df.reset_index(level=0,inplace=True)
print(df)
#convert to string format
df.Date = df.Date.apply(lambda d: ' '.join(d.isoformat().split('T')))
print(df)
#convert back to datetime format
df.Date = df.Date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d %H:%M:%S"))
print(df)

df.index = df.Date
print(df)

        Date         A         B         C         D
0 2013-01-01  2.343150  0.864919 -0.349607  1.303671
1 2013-01-02 -0.256150 -0.554323 -0.508537  0.966387
2 2013-01-03 -0.224787 -0.346977  0.234857 -0.044442
3 2013-01-04 -0.034938  0.765512 -0.399237  0.878525
4 2013-01-05  0.016009 -0.329236  1.260952 -1.511141
5 2013-01-06 -1.113445  0.667540 -0.993290 -0.548711
                  Date         A         B         C         D
0  2013-01-01 00:00:00  2.343150  0.864919 -0.349607  1.303671
1  2013-01-02 00:00:00 -0.256150 -0.554323 -0.508537  0.966387
2  2013-01-03 00:00:00 -0.224787 -0.346977  0.234857 -0.044442
3  2013-01-04 00:00:00 -0.034938  0.765512 -0.399237  0.878525
4  2013-01-05 00:00:00  0.016009 -0.329236  1.260952 -1.511141
5  2013-01-06 00:00:00 -1.113445  0.667540 -0.993290 -0.548711
        Date         A         B         C         D
0 2013-01-01  2.343150  0.864919 -0.349607  1.303671
1 2013-01-02 -0.256150 -0.554323 -0.508537  0.966387
2 2013-01-03 -0.224787 -0.346

Applying a function using column data, but with extra parameters.  In the example below we use a value in a single DataFrame column 'IrradianceLux', together with extra parameters, to calculate a new row.

http://stackoverflow.com/questions/21188504/python-pandas-apply-a-function-with-arguments-to-a-series-update

In [4]:
import pandas as pd
lx = {'Sunlight': 107527, 
      'Full daylight': 10752,
      'Overcast day':1075,
      'Very dark day':107,
      'Twilight': 10.8,
      'Deep twilight': 1.08,
      'Full moon': 0.108,
      'Quarter moon':0.0108,
      'Starlight': 0.0011,
      'Overcastnight':0.0001
    }
fnos = [1.4, 2, 2.74, 3.8, 5.4, 7.5
       ]

def calcIrrad(lx, rho, taua, tauo, fno):
    return lx * rho * taua * tauo / (4 * fno ** 2)
    
df = pd.DataFrame(list(lx.items()), columns=['Condition','IrradianceLux'])

rho = 0.3
taua = 0.5
tauo = 0.9
for fno in fnos:
    df['{}'.format(fno)] = df.IrradianceLux.apply(calcIrrad, args=(rho, taua, tauo, fno) )
    
df.sort('IrradianceLux')

Unnamed: 0,Condition,IrradianceLux,1.4,2,2.74,3.8,5.4,7.5
8,Overcastnight,0.0001,2e-06,1e-06,0.0,0.0,0.0,6e-08
2,Starlight,0.0011,1.9e-05,9e-06,5e-06,3e-06,1e-06,6.6e-07
4,Quarter moon,0.0108,0.000186,9.1e-05,4.9e-05,2.5e-05,1.2e-05,6.48e-06
6,Full moon,0.108,0.00186,0.000911,0.000486,0.000252,0.000125,6.48e-05
3,Deep twilight,1.08,0.018597,0.009113,0.004855,0.002524,0.00125,0.000648
7,Twilight,10.8,0.185969,0.091125,0.048551,0.025242,0.0125,0.00648
1,Very dark day,107.0,1.842474,0.902813,0.481013,0.250087,0.123843,0.0642
0,Overcast day,1075.0,18.510842,9.070312,4.832603,2.512552,1.244213,0.645
5,Full daylight,10752.0,185.142857,90.72,48.335021,25.130194,12.444444,6.4512
9,Sunlight,107527.0,1851.549107,907.259063,483.381673,251.3183,124.452546,64.5162


In [None]:
Apply a function to existing columns to create a new column

In [3]:
def fx(x, y):
    return x*y

import numpy as np
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
df['nocols'] = np.vectorize(fx)(4, 5)
df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
df

Unnamed: 0,A,B,nocols,new_column
0,10,20,20,200
1,20,30,20,600
2,30,10,20,300


Apply a named function on multiple columns and scalar arguments

In [2]:
import pandas as pd 
data = {'gene':['a','b','c','d','e'],
        'count':[61,320,34,14,33],
        'gene_length':[152,86,92,170,111]}
df = pd.DataFrame(data)
df = df[["gene","count","gene_length"]]

def calculate_RPKM(theC,theN, theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = ((10**9) * theC)/(theN * theL)
    return rpkm
N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
df

Unnamed: 0,gene,count,gene_length,rpkm
0,a,61,152,32508.366908
1,b,320,86,301411.926493
2,c,34,92,29936.429112
3,d,14,170,6670.955138
4,e,33,111,24082.405613


Use apply to return multiple columns

<http://stackoverflow.com/questions/16236684/apply-pandas-function-to-column-to-create-multiple-new-columns>

In [2]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'textcol' : np.random.rand(5)})
df.merge(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})), 
    left_index=True, right_index=True)

Unnamed: 0,textcol,feature1,feature2
0,0.82329,1.82329,-0.17671
1,0.481875,1.481875,-0.518125
2,0.783152,1.783152,-0.216848
3,0.407895,1.407895,-0.592105
4,0.483882,1.483882,-0.516118


Return multiple columns, operating on a single column, with additional arguments

In [5]:
def myfunc(s,a1, a2):
    return pd.Series({'feature1':s+a1, 'feature2':s+a2})
    
df = pd.DataFrame({'textcol' : np.random.rand(5)})
df.merge(df.textcol.apply(myfunc,args=(+2, -5)), 
    left_index=True, right_index=True)    
    

Unnamed: 0,textcol,feature1,feature2
0,0.835452,2.835452,-4.164548
1,0.186309,2.186309,-4.813691
2,0.253024,2.253024,-4.746976
3,0.026933,2.026933,-4.973067
4,0.463743,2.463743,-4.536257


The following example calculates the angle between two normal vectors, where the two vectors are given in separate tables. The common value whereby the two tables are joined is given in the 'Key' column in each of the tables.

In [72]:
from numpy import linalg as LA

def normCols(df, lst):
    """Normalise the columns in df, as listed in lst"""
    #get the vector length
    df['norm'] = (LA.norm(df[lst],axis=1))
    #normalise cols in list and return
    df[lst] = df[lst].divide(df['norm'], axis=0)
    df.drop('norm',axis=1,inplace=True)
    #value seems to be returned in the df parameter passed on the function call
    return 

#create the data: [key, x, y, z]
lstA = [['a',1,0,0],['b',1,0,0],['c', 1,0,0],['d',-4.164548,2.835452,0.835452],['e',-4.164548,2.835452,0.835452]]
lstB = [['a',1,0,0],['b',0,1,0],['c',-1,0,0],['d',-4.164548,2.835452,0.835452],['e',-3.164548,1.835452,2.835452]]

#make dataframes
vec = ['x','y','z']
cols = ['Key'] + vec
dfA = pd.DataFrame(lstA,columns=cols)
dfB = pd.DataFrame(lstB,columns=cols)

#normalise vectors
normCols(dfA, vec)
normCols(dfB, vec)

#join the two vectors on the Key column
dfA.reset_index(inplace=True)
dfB.reset_index(inplace=True)
suffixes = ['a','b'] # used in labelling duplicate column names
dfM = pd.merge(dfA, dfB, left_on=['Key'],right_on=['Key'], how='inner', suffixes=suffixes)  

#calc angle between vectors
va = ['xa','ya','za']
vb = ['xb','yb','zb']
# inner() calculates Va x Vb.T, on which the diagonal() is 
# the dot product of the respective rowA and rowB.T vectors
dfM['angle'] = np.arccos(np.diagonal(np.inner(dfM[va],dfM[vb])))

print(dfM)

   indexa Key        xa        ya       za  indexb        xb        yb  \
0       0   a  1.000000  0.000000  0.00000       0  1.000000  0.000000   
1       1   b  1.000000  0.000000  0.00000       1  0.000000  1.000000   
2       2   c  1.000000  0.000000  0.00000       2 -1.000000  0.000000   
3       3   d -0.815462  0.555211  0.16359       3 -0.815462  0.555211   
4       4   e -0.815462  0.555211  0.16359       4 -0.683709  0.396554   

         zb         angle  
0  0.000000  0.000000e+00  
1  0.000000  1.570796e+00  
2  0.000000  3.141593e+00  
3  0.163590  1.490116e-08  
4  0.612607  4.992820e-01  


In [63]:
a=np.array([[1,2],[3,4]])
b=np.array([[11,12],[13,14]])
print(a)
print(b.T)
# print(np.dot(a,b))
print(np.inner(a,b))


[[1 2]
 [3 4]]
[[11 13]
 [12 14]]
[[35 41]
 [81 95]]


###Histograms

[Histogramming and Discretization](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-discretization)

In [129]:
s = pd.Series(np.random.randint(0,7,size=10))
s.value_counts()

3    3
0    3
1    2
4    1
2    1
dtype: int64

### Strings
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses [regular expressions](https://docs.python.org/2/library/re.html) by default (and in some cases always uses them). See more at [Vectorized String Methods](http://pandas.pydata.org/pandas-docs/version/0.15.2/text.html#text-string-methods).

In [130]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
s.str.upper()
s.str.len()

0     1
1     1
2     1
3     4
4     4
5   NaN
6     4
7     3
8     3
dtype: float64

In [131]:
# Methods like split return a Series of lists:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_')

0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

In [132]:
s2.str.split('_').str[1]

0      b
1      d
2    NaN
3      g
dtype: object

In [133]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
s.str[1]

0    NaN
1    NaN
2    NaN
3      a
4      a
5    NaN
6      A
7      o
8      a
dtype: object

You can use [] notation to directly index by position locations. If you index past the end of the string, the result will be a NaN.

In [134]:
# Easy to expand this to return a DataFrame
s2.str.split('_').apply(pd.Series)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


Methods like replace and findall take regular expressions, too:

In [135]:
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3.str.replace('^.a|dog', 'XX-XX ', case=False)

0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

###Grouping
By “group by” we are referring to a process involving one or more of the following steps  

- Splitting the data into groups based on some criteria  
- Applying a function to each group independently    
- Combining the results into a data structure   

[grouping](http://pandas.pydata.org/pandas-docs/version/0.15.2/groupby.html#groupby)

In [136]:
df = makefoobar()
df

Unnamed: 0,A,B,C,D
0,foo,one,-1.0828,1.413746
1,bar,one,-0.428831,0.261036
2,foo,two,-2.172585,0.217979
3,bar,three,0.481878,-0.737276
4,foo,two,1.241357,-0.686429
5,bar,two,0.592939,0.518522
6,foo,one,0.816336,0.666551
7,foo,three,-0.713144,-0.243731


Grouping and then applying a function sum to the resulting groups.

In [137]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.645986,0.042282
foo,-1.910837,1.368115


In [138]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.428831,0.261036
bar,three,0.481878,-0.737276
bar,two,0.592939,0.518522
foo,one,-0.266464,2.080296
foo,three,-0.713144,-0.243731
foo,two,-0.931229,-0.46845


In [139]:
df = makefoobar()
print(df)
cnts = {}
#first value is group column value, seond value is the members in the group
for grp, grp_data in df.groupby("B"):
    cnts[grp] = grp_data.C.mean()  
cnts

     A      B         C         D
0  foo    one -1.498075 -1.508066
1  bar    one  0.172493  1.327274
2  foo    two  0.583990  1.128944
3  bar  three  0.261032  1.030519
4  foo    two  1.018042 -0.970777
5  bar    two  0.350983  0.795607
6  foo    one  1.404333 -1.182395
7  foo  three -0.930559 -0.555590


{'one': 0.026250279623687572,
 'three': -0.3347634110120622,
 'two': 0.65100483607779791}

###Reshaping

[Hierarchical Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced-hierarchical) and [Reshaping](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-stacking).

In [140]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                    'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                    'one', 'two', 'one', 'two']]))
print(tuples)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2

[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.890401,-1.220275
bar,two,2.493911,-0.14082
baz,one,0.229494,-0.006752
baz,two,-0.007104,0.274393


The stack function “compresses” a level in the DataFrame’s columns.

In [141]:
stacked = df2.stack()
stacked

first  second   
bar    one     A    0.890401
               B   -1.220275
       two     A    2.493911
               B   -0.140820
baz    one     A    0.229494
               B   -0.006752
       two     A   -0.007104
               B    0.274393
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:

In [142]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.890401,-1.220275
bar,two,2.493911,-0.14082
baz,one,0.229494,-0.006752
baz,two,-0.007104,0.274393


In [143]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,0.890401,2.493911
bar,B,-1.220275,-0.14082
baz,A,0.229494,-0.007104
baz,B,-0.006752,0.274393


In [144]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.890401,0.229494
one,B,-1.220275,-0.006752
two,A,2.493911,-0.007104
two,B,-0.14082,0.274393


##Categoricals
see the [categorical introduction](http://pandas.pydata.org/pandas-docs/version/0.15.2/categorical.html#categorical) and the [API documentation](http://pandas.pydata.org/pandas-docs/version/0.15.2/api.html#api-categorical).

In [145]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
# Convert the raw grades to a categorical data type.
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a < b < e]

Rename the categories to more meaningful names (assigning to Series.cat.categories is in place!) Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series per default).

In [146]:
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad < bad < medium < good < very good]

In [147]:
# Sorting is per order in the categories, not lexical order.
df.sort("grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


In [148]:
# Grouping by a categorical column shows also empty categories.
df.groupby("grade").size()

grade
very bad      1
bad         NaN
medium      NaN
good          2
very good     3
dtype: float64

##Pivot tables
See the section on [Pivot Tables](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-pivot).

In [149]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,-1.694450,1.181332
1,one,B,foo,-0.649523,-0.960024
2,two,C,foo,-0.147176,-1.317824
3,three,A,bar,0.087274,-0.870640
4,one,B,bar,-0.056417,1.048383
...,...,...,...,...,...
7,three,B,foo,0.764470,-0.977544
8,one,C,foo,-1.283480,0.926650
9,one,A,bar,0.860165,0.542091
10,two,B,bar,-0.620002,-1.021743


In [150]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])


Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.860165,-1.69445
one,B,-0.056417,-0.649523
one,C,1.536857,-1.28348
three,A,0.087274,
three,B,,0.76447
three,C,-0.264644,
two,A,,0.955994
two,B,-0.620002,
two,C,,-0.147176


Reconstruct the sampled presented in the pages at [Pandas Pivot Table Explained ](http://pbpython.com/pandas-pivot-table-explained.html) and [Generating Excel Reports from a Pandas Pivot Table ](http://pbpython.com/pandas-pivot-report.html).  

First load the data and set the Status column as a pandas `category` and set the viewing order. Set the `Name` as table index.

In [151]:
df = pd.read_excel("./data/sales-funnel.xlsx")
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
print(df.head(4))

print(pd.pivot_table(df,index=["Name"]))


   Account                          Name           Rep       Manager      Product  Quantity  Price     Status
0   714466               Trantow-Barrows  Craig Booker  Debra Henley          CPU         1  30000  presented
1   714466               Trantow-Barrows  Craig Booker  Debra Henley     Software         1  10000  presented
2   714466               Trantow-Barrows  Craig Booker  Debra Henley  Maintenance         2   5000    pending
3   737550  Fritsch, Russel and Anderson  Craig Booker  Debra Henley          CPU         1  35000   declined
                              Account  Price  Quantity
Name                                                  
Barton LLC                     740150  35000  1.000000
Fritsch, Russel and Anderson   737550  35000  1.000000
Herman LLC                     141962  65000  2.000000
Jerde-Hilpert                  412290   5000  2.000000
Kassulke, Ondricka and Metz    307599   7000  3.000000
...                               ...    ...       ...
Koepp Ltd 

##Comparing and Gotchas
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-compare>  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#boolean-reductions>   

pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not.  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/gotchas.html#gotchas>


#Date and time

In [57]:
# print('0')
# TMY = pd.DataFrame([0,1],index=['1981-01-01T10:00:00.000000000+0200', '1981-01-01T11:00:00.000000000+0200'])
# print(TMY)
# print('1')
# print(type(TMY.index.to_datetime().values))
# print(TMY.index.to_datetime().values)
# print('2')
# print(type(TMY.index.astype(np.int64)))
# print(TMY.index.astype(np.int64) // 10**9)  #timestamp is unix time with nanoseconds
# print('3')
# print(type(pd.to_datetime(TMY.index.astype(np.int64))))
# print(pd.to_datetime(TMY.index.astype(np.int64)))  #timestamp is unix time with nanoseconds


## Python and [module versions, and dates](http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-0-Scientific-Computing-with-Python.ipynb)

In [152]:
%load_ext version_information
%version_information pandas, numpy, scipy, matplotlib, pyradi

Software,Version
Python,2.7.8 32bit [MSC v.1500 32 bit (Intel)]
IPython,3.0.0
OS,Windows 7 6.1.7601 SP1
pandas,0.15.2
numpy,1.9.2
scipy,0.15.1
matplotlib,1.4.3
pyradi,0.1.56
Tue Jun 16 20:05:19 2015 South Africa Standard Time,Tue Jun 16 20:05:19 2015 South Africa Standard Time
