#Pandas Cheat Sheet

References:  
<http://pandas.pydata.org/pandas-docs/stable/basics.html>  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/10min.html>    
<http://synesthesiam.com/posts/an-introduction-to-pandas.html>  
<http://pbpython.com/excel-pandas-comp.html>  
<http://pbpython.com/excel-pandas-comp-2.html>  
<http://pbpython.com/improve-pandas-excel-output.html>  

<http://pbpython.com/pandas-pivot-table-explained.html.
<http://pbpython.com/pandas-pivot-report.html>

<http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook-multi-index>  

<http://www.bigdataexaminer.com/14-best-python-pandas-features/>  
<http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping>  
<https://iqbalnaved.wordpress.com/2013/08/26/python-pandas-hacks/>   

<https://plot.ly/ipython-notebooks/big-data-analytics-with-pandas-and-sqlite/>  
<http://www.analyticsvidhya.com/blog/2015/04/comprehensive-guide-data-exploration-sas-using-python-numpy-scipy-matplotlib-pandas/>  




In [148]:
import numpy as np
import pandas as pd

import datetime

<https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf>   

|Create pandas data structures| |
|--|--|
|s = Series(data, index) |Create a Series.|
|df = DataFrame (data, index, columns) |Create a Dataframe.|
|p = Panel(data, items, major_axis, minor_axis)|Create a Panel.|


|	DataFrame Commands	|		|
|--|--|
|	df[col]	|	Select column.	|
|	df.iloc[label]	|	Select row by label.	|
|	df.index	|	Return DataFrame index.	|
|	df.drop()	|	Delete given row or column. Pass axis=1 for columns.	|
|	df1 = df1.reindex_like(df1,df2)	|	Reindex df1 with index of df2.	|
|	df.reset_index()	|	Reset index, putting old index in column named index.	|
|	df.reindex()	|	Change DataFrame index, new indecies set to NaN.	|
|	df.head(n)	|	Show first n rows.	|
|	df.tail(n)	|	Show last n rows.	|
|	df.sort()	|	Sort index.	|
|	df.sort(axis=1)	|	Sort columns.	|
|	df.pivot(index,column,values)	|	Pivot DataFrame, using new conditions.	|
|	df.T	|	Transpose DataFrame.	|
|	df.stack()	|	Change lowest level of column labels into innermost row index.	|
|	df.unstack()	|	Change innermost row index into lowest level of column labels.	|
|	df.applymap()	|	Apply function to every element in DataFrame.	|
|	df.apply()	|	Apply function along a given axis	|
|	df.dropna()	|	Drops rows where any data is missing.	|
|	df.count()	|	Returns Series of row counts for every column.	|
|	df.min()	|	Return minimum of every column.	|
|	df.max()	|	Return maximum of every column.	|
|	df.describe()	|	Generate various summary statistics for every column.	|
|	concat()	|	Merge DataFrame or Series objects	|	

|	Groupby	|		|
|--|--|
|	groupby()	|	Split DataFrame by columns. Creates a GroupBy object (gb).	|
|	gb.agg()	|	Apply function (single or list) to a GroupBy object.	|
|	gb.transform()	|	Applies function and returns object with same index as one being grouped.	|
|	gb.filter()	|	Filter GroupBy object by a given function.	|
|	gb.groups	|	Return dict whose keys are the unique groups, and values are axis labels belonging to each group.	|

|	I/O	|		|
|--|--|
|	df.to_csv(‘foo.csv’)	|	Save to CSV.	|
|	read_csv(‘foo.csv’)	|	Read CSV into DataFrame.	|
|	to_excel(‘foo.xlsx’, sheet_name)	|	Save to Excel.	|
|	read_excel(‘foo.xlsx’,’sheet1’, index_col = None, na_values = [‘NA’])	|	Read exel into DataFrame	|

	


##Set the display width when printing to console

There are quite a few options to configure here, if you're using ipython then tab complete to find the [full set](http://pandas.pydata.org/pandas-docs/version/0.15.2/options.html) of display options:

    pd.options.display.<tab>

<http://stackoverflow.com/questions/21249206/how-to-configure-display-output-in-ipython-pandas>

In [149]:
#maximum number of rows and columns displayed when a frame is pretty-printed
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 10)
# Width of the display in characters.
pd.set_option('display.width', 150)
# The maximum width in characters of a column in the repr of a pandas data structure.              
pd.set_option('display.max_colwidth', 150)

##Creating/Loading Data

###Functions to create different dataframe types

An empty DataFrame can be created as follows. Test to see if the DataFrame is empty. In this case it is.

In [150]:
columns = ['A','B', 'C']
df = pd.DataFrame(columns=columns)
print(df)
print(df.empty)

Empty DataFrame
Columns: [A, B, C]
Index: []
True


If you add an index, the row contents for the rows specified by the index will be empty (filled with NaN0.  However, testing to see if the DataFrame is empty will show that it is not empty: there are rows.

In [151]:
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=3, freq='D')
columns = ['A','B', 'C']
df = pd.DataFrame(index=index, columns=columns)
print(df)
print(df.empty)
df = df.fillna(0) # with 0s rather than NaNs
print(df)

              A    B    C
2015-04-16  NaN  NaN  NaN
2015-04-17  NaN  NaN  NaN
2015-04-18  NaN  NaN  NaN
False
            A  B  C
2015-04-16  0  0  0
2015-04-17  0  0  0
2015-04-18  0  0  0


###Creating and filling DataFrames

The following  functions create pandas dataframes in a variety of ways.

In [156]:
# DataFrame by passing a numpy array, with a datetime index and labeled columns.
def makeDateRand(nrows=6, ncols=4):
    dates = pd.date_range('20130101',periods=6)
    df = pd.DataFrame(np.random.randn(nrows,ncols),index=dates,columns=list('ABCD'))    
    return df

In [153]:
# DataFrame by passing a numpy array, with an int index and labeled columns.
def makeRand(nrows=4, ncols=4):
    return pd.DataFrame(np.random.randn(nrows, ncols), columns=['A','B','C','D'])

In [88]:
#create from dictionary
def makefoobar():
    return  pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                          'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                          'C' : np.random.randn(8),
                          'D' : np.random.randn(8)})

In [89]:
#create from dictionary
def makegridDF():
    return  pd.DataFrame({'A' : [1,2,3],
                          'B' : [4,5,6],
                          'C' : [7,8,9],
                          'D' : [10,11,12]})

In [197]:
#create from dictionary
def makegAlphaDF():
    return  pd.DataFrame({'A' : ['0a','1a','2a'],
                          'B' : ['0b','1b','2b'],
                          'C' : ['0c','1c','2c'],
                          'D' : ['0d','1d','2d']})

In [90]:
#create dataframe from a user-supplied string, using a user-defined regex separator 
def makeFromString(string, sep='\s+', header=False):
    from StringIO import StringIO
    return pd.read_csv(StringIO(string), sep=sep, header=header)
#alternative method
#     import io
#     return pd.read_table(io.BytesIO(content), sep=sep, header=header)

In [91]:
# DataFrame by passing a dict of objects that can be converted to series-like.
# using categorical in column E
def makecatedf():
    df = pd.DataFrame({   'A' : 1.,
                       'B' : pd.Timestamp('20130102'),
                       'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                       'D' : np.array([3] * 4,dtype='int32'),
                       'E' : pd.Categorical(["test","train","test","train"]),
                       'F' : 'foo',
                       'G': ['foox','fooa','foon','fooz']})
    return (df)

In [92]:
#create a dataframe with a NaN
def makeNaNdf():
    return pd.DataFrame([[1, np.nan], [3, 4], [4,5]], columns=list('AB'))

In [93]:
#create a DataFrame with hierarchical column index
#From http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
def createMultColIdx():
    return pd.DataFrame([list('abcd'),
                  list('efgh'),
                  list('ijkl'),
                  list('mnop')],
                  columns=pd.MultiIndex.from_product([['one','two'],
                      ['first','second']]))

In [193]:
print(makeDateRand())

                   A         B         C         D
2013-01-01 -0.258388 -2.164884  0.105801  1.249241
2013-01-02  0.506519  0.345358 -0.297853  0.304403
2013-01-03  0.236215  0.712320 -0.585526 -0.592531
2013-01-04 -0.368021 -1.010381  0.372366  0.638331
2013-01-05  1.214579 -1.372529 -1.449449  0.526774
2013-01-06  1.106840 -1.064536 -1.043180 -0.668716


Display the data types

In [95]:
df2 = makecatedf()
print(df2.dtypes)
df2

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
G            object
dtype: object


Unnamed: 0,A,B,C,D,E,F,G
0,1,2013-01-02,1,3,test,foo,foox
1,1,2013-01-02,1,3,train,foo,fooa
2,1,2013-01-02,1,3,test,foo,foon
3,1,2013-01-02,1,3,train,foo,fooz


In [96]:
content = '''
Time       A_x       A_y       A_z       B_x       B_y       B_z
-0.075509 -0.123527 -0.547239 -0.453707 -0.969796  0.248761  1.369613
-0.133580 -0.308314 -0.839347 -0.517989  0.652120  0.477232 -0.391767
 0.623841  0.473552  0.059428  0.726088 -0.593291 -3.186297 -0.846863'''

makeFromString(content)

Unnamed: 0,Time,A_x,A_y,A_z,B_x,B_y,B_z
0,-0.075509,-0.123527,-0.547239,-0.453707,-0.969796,0.248761,1.369613
1,-0.13358,-0.308314,-0.839347,-0.517989,0.65212,0.477232,-0.391767
2,0.623841,0.473552,0.059428,0.726088,-0.593291,-3.186297,-0.846863


If you're using IPython, tab completion for column names (as well as public attributes) is automatically enabled:  
    `df2.<Tab>`

##File Input/Output

###CSV and Excel

In [97]:
df = makeDateRand()
df.to_csv('foo.csv')
pd.read_csv('foo.csv')

NameError: global name 'nrow' is not defined

In [98]:
df = makeDateRand()
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

NameError: global name 'nrow' is not defined

###HDF5

Excel and CSV  formats can only store single elements per 'cell.   HDF5 provides the means to store hierarchical data, where some DataFrame cells can contain Numpy arrays or other structures.  The example below has a Numpy array in column 'array' and a pandas Series in column 'dframe'.  

Note that the Series index must match the dataframe index, otherwise the Series elements cannot be assigned (NaN are then assigned to all elements in the column).

<http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#io-hdf5>

In [99]:
df = makeDateRand()
df['arrays'] = [np.asarray([[1,x],[x/2,4*x]]) for x in range(6)]
ser = pd.Series([np.asarray([[1,x],[x/2,4*x]]) for x in range(6)],index=pd.date_range('20130101',periods=6))
print(ser)
df['dframe'] = ser
print(df)
df.to_hdf('df.hdf5','df',mode='w',append=False)
df = pd.read_hdf('df.hdf5', 'df')
print(df)
print(df.dtypes)

NameError: global name 'nrow' is not defined

##Dataframe properties

###Row properties

The row count can be obtained in two different forms:

- The `len` and `shape` methods count the number of rows in the DataFrame, irrespective of the contents of the cells.
- The `df.count()` function returns a pandas series containing the number of valid entries in a column - ignoring NaN. 

In [100]:
df = makeNaNdf()
print(df)
print('\nlen(df) = {}'.format(len(df)))
print('\nshape[0] = {}'.format(df.shape[0]))
print('\ntype(df.count()) = {}'.format(type(df.count())))
print('\ndf.count() = \n{}'.format(df.count()))
print("\ndf.count()['A'] = {}".format(df.count()['A']))
print('\ndf.count()[1] = {}'.format(df.count()[1]))
print('\nNans = \n{}'.format(df.apply(lambda col: pd.isnull(col))))

   A   B
0  1 NaN
1  3   4
2  4   5

len(df) = 3

shape[0] = 3

type(df.count()) = <class 'pandas.core.series.Series'>

df.count() = 
A    3
B    2
dtype: int64

df.count()['A'] = 3

df.count()[1] = 2

Nans = 
       A      B
0  False   True
1  False  False
2  False  False


The DataDrame index (row names) can be retrieved as a [`pandas.index `](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html), which can be used to retrieve a list of the row names:

In [101]:
df = makegridDF()
print(df.index)
print(df.index.tolist())

Int64Index([0, 1, 2], dtype='int64')
[0, 1, 2]


If the DataDrame index is a more complex data type, the data type is returned:

In [102]:
df = makeDateRand()
print(df.index)
print(df.index.tolist()) # returns a list
print(df.index.values) # returns an array

NameError: global name 'nrow' is not defined

In [103]:
#set / change the index name
df = makegridDF()
df.index.name = 'MyIndex'
print(df)
print(df.index.tolist()) # returns a list
print(df.index.values)  # returns an array

         A  B  C   D
MyIndex             
0        1  4  7  10
1        2  5  8  11
2        3  6  9  12
[0, 1, 2]
[0 1 2]


Use a column's values to set the index accordingly. Now that a repeat value appears to be allowed - strange.

In [104]:
df = makegridDF()
print(df)

df.loc[2,'A'] = 2
df.index = df.A
df.index.name = 'MyNewIndex'
print(df.loc[2,:])


   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
            A  B  C   D
MyNewIndex             
2           2  5  8  11
2           2  6  9  12


### Column properties

The column names can be retrieved as a [`pandas.index `](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html):

In [105]:
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

Get a list of the columns in dataframe - there are two ways to do this:

In [106]:
df = makeDateRand()
print(list(df.columns.values))
print(df.columns.values.tolist()) #fastest
print(list(df))

NameError: global name 'nrow' is not defined

###DataFrame values

Get the values of the DataFrame contents as a Numpy array:

In [107]:
df = makeDateRand()
df.values

NameError: global name 'nrow' is not defined

##NaN in Pandas / Numpy

Empty cells, or cells with missing data are filled with NaNs.  The example shows how to test for NaN values (`isnull()`).

In [108]:
a = np.nan
print(a)
print(pd.isnull(a))

nan
True


Use the `fillna()` function to fill NaN cells with some other value.

In [109]:
df = pd.DataFrame([[1, np.nan], [3, 4]], columns=list('AB'))
print(df)
print('\nNans = \n{}'.format(df.apply(lambda col: pd.isnull(col))))
# print(pd.isnull(a))

#change all NaN to some other value
df.B = df.B.fillna('**')
print('\nNans replaced with ** = \n{}'.format(df))


   A   B
0  1 NaN
1  3   4

Nans = 
       A      B
0  False   True
1  False  False

Nans replaced with ** = 
   A   B
0  1  **
1  3   4


##Manipulating DataFrames

###View of a DataFrame vs a Copy of a DataFrame

See [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy) for the full example.
           
More information on [multi-indexing](http://pandas-docs.github.io/pandas-docs-travis/advanced.html)

In [110]:
dfmi = createMultColIdx()
print(dfmi)

    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p


In the code below the first form `dfmi['one']['second']` is called the chained method, both using the `__getitem__` method, but happening in sequence.  The first call `dfmi['one']` returns a DataFrame which is input to the second call`(dfmi['one'])['second']` - these are two calls, one happening after the other.  

The second form `df.loc[:,('one','second')]`  passes a nested tuple to a single call to `__getitem__`, which can be significantly faster, and allows one to index both axes if so desired. Look at the name of the DataFrame returned in both cases and spot the difference.

In [111]:
print(dfmi['one']['second'])
print(dfmi.loc[:,('one','second')])


0    b
1    f
2    j
3    n
Name: second, dtype: object
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object


The first forms gives a `SettingWithCopyWarning` warning.  Since the chained indexing is 2 calls, it is possible that either call may return a copy of the data because of the way it is sliced. Thus when setting, you are actually setting a copy, and not the original frame data. 

The `.loc` operation is a single python operation, and thus can select a slice (which still may be a copy), but allows pandas to assign that slice back into the frame after it is modified, thus setting the values as you would think.

The reason for having the `SettingWithCopy` warning is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if it is sliced in a particular way. A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array.

In [112]:
dfmi['one']['second'] = 1 # assignment has no effect on the original!!
print(dfmi)

    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


To get the desired effect, use `.loc` to directly address the original DataFrame.  The `slice` method is used to select multiple column levels.

<http://pandas-docs.github.io/pandas-docs-travis/advanced.html#using-slicers>

In [113]:
dfmi.loc[:,slice('one','second')] = 1
print(dfmi)

    one          two       
  first second first second
0     1      1     c      d
1     1      1     g      h
2     1      1     k      l
3     1      1     o      p


###Dropping a row from a DataFrame

When using drop(), note the axis direction.   
- to drop a column axis=1  
- to drop a row axis=0  (default)

In [114]:
#drop a column
df = makegridDF()
#first drop the 'A' column
print(df.drop('A',axis=1))

   B  C   D
0  4  7  10
1  5  8  11
2  6  9  12


You can also use row indexes to drop columns.  The following example shows several index-based row drop methods.

In [115]:
df = makegridDF()
print(df)
print(df.drop(1)) # singe index
print(df.drop([1,2])) # list of indexes
print(df.drop(0,axis=0)) #single index explicit row selection
print(df.drop(df.index[[0,2]])) #list in the index 
for idx, row in df.iterrows():#iterate over all rows
    print(idx,row)
    df.drop(idx,inplace=True)
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
0  1  4  7  10
2  3  6  9  12
   A  B  C   D
0  1  4  7  10
   A  B  C   D
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
1  2  5  8  11
(0, A     1
B     4
C     7
D    10
Name: 0, dtype: int64)
(1, A     2
B     5
C     8
D    11
Name: 1, dtype: int64)
(2, A     3
B     6
C     9
D    12
Name: 2, dtype: int64)
Empty DataFrame
Columns: [A, B, C, D]
Index: []


In [116]:
#drop some of the rows
df = makeDateRand()
df = df.drop(df.index[[0,1,2,3]])
for idx, row in df.iterrows():
    print(idx,row)

NameError: global name 'nrow' is not defined

In [117]:
#drop row based on value in a column
df = makegridDF()
print(df)
print(df[df['A'] >= 3])
print(df[(df['A'] >= 2) & (df['B']<6)])

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
2  3  6  9  12
   A  B  C   D
1  2  5  8  11


###Concatenation or Appending rows or DataFrames

`df.shape[0]` returns the number of rows already in the DataFrame (zero-based), hence if `df.shape[0]` is used as a row index, it will point to a new row immediately beyond the current last row.  This is an easy way to add row(s) to an existing DataFrame. 

Rows can be added to the DataFrame by [setting with enlargement](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#setting-with-enlargement).  The `df.loc[i]` (location) construct points to row `i`, which need not be an existing row.

In [118]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df)
df.loc[df.shape[0]] = ['a','b'] # add a row immediately beyond the current last
df.loc[df.shape[0]] = [np.nan,'new!'] # add a row immediately beyond the current last
print(df)

   A  B
0  1  2
1  3  4
     A     B
0    1     2
1    3     4
2    a     b
3  NaN  new!


Append rows to a dataframe. See the [Appending](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-concatenation).  This examples makes a copy of one of the rows and append it to the DataFrame.  Note that in this case a `copy()` is required to create a new DataFrame which is modified before appending.

In [119]:
df = makeRand()
s = df.iloc[2].copy() # copy is required, otherwise a view is taken
s[2] = 1000
print(s)
df.append(s, ignore_index=True)

A       0.642396
B      -1.006489
C    1000.000000
D      -0.098884
Name: 2, dtype: float64


Unnamed: 0,A,B,C,D
0,-1.345419,-1.068482,-0.398066,-0.137575
1,0.937761,-2.201902,-1.189793,1.099483
2,0.642396,-1.006489,-2.081371,-0.098884
3,0.290115,0.647737,0.880789,0.97461
4,0.642396,-1.006489,1000.0,-0.098884


The `append()` function can be used to add one more rows formed as DataFrames to an existing DataFrame.

In [120]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df = df.append(df2) # append row(s)
print(df)

   A  B
0  1  2
1  3  4
0  5  6
1  7  8


Concatenate some existing rows from the current DataFrame to itself.

In [121]:
df = makeRand()
df2 = pd.concat([df,df[2:4]])
df2

Unnamed: 0,A,B,C,D
0,-1.500266,0.976661,0.013012,0.00899
1,0.960616,1.458199,-0.131962,-0.692836
2,2.065584,-0.710521,-1.226007,-1.072416
3,-0.74877,-2.214471,0.323839,-0.469724
2,2.065584,-0.710521,-1.226007,-1.072416
3,-0.74877,-2.214471,0.323839,-0.469724


The following examples concatenates three views of a DataFrame.

In [122]:
# Concatenating pandas objects together
df = makeRand(10,4)
print(df)
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)


          A         B         C         D
0 -0.983167  1.698776  0.788667  1.279863
1  1.386260 -1.191339 -2.430752 -0.512346
2  0.538155  0.278630 -1.896839 -1.454279
3 -0.259158 -0.241541  1.248967  0.102363
4 -0.584848  1.324369 -0.325285  1.075654
5  1.066595 -0.448447  0.153916 -0.058102
6 -0.715968 -0.801779  1.361379  0.793378
7  0.768382 -0.262929 -0.948817  0.379604
8 -0.464480 -1.075075  0.572889  0.912850
9  0.295642 -1.892635 -0.131382 -0.584188


Unnamed: 0,A,B,C,D
0,-0.983167,1.698776,0.788667,1.279863
1,1.38626,-1.191339,-2.430752,-0.512346
2,0.538155,0.27863,-1.896839,-1.454279
3,-0.259158,-0.241541,1.248967,0.102363
4,-0.584848,1.324369,-0.325285,1.075654
5,1.066595,-0.448447,0.153916,-0.058102
6,-0.715968,-0.801779,1.361379,0.793378
7,0.768382,-0.262929,-0.948817,0.379604
8,-0.46448,-1.075075,0.572889,0.91285
9,0.295642,-1.892635,-0.131382,-0.584188


Concatenation stacks together rows from two arrays. In the example below the `df` array is concatenated with a 2x2 slice from `dfr`.  There are two observations from the code below:

- The column names are used when concatenating the rows. In the first example the column names are consistent and appended as expected. In the second example the row names do not exactly agree and cells with missing data are filled with NaN.
- The index data type of the concatenated DataFrame must be the same as the main DataFrame (hash error occurs otherwise).  For example if the index is the DateTime series, the concatenation will not work.

In [123]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
dfr = makegridDF()
print(dfr)
print('\nExample 1')
print('to be contatenated={}'.format(dfr.loc[1:2,['A','B']]))
df2 = pd.concat([df,dfr.loc[1:2,['A','B']]])
print(df2)

print('\nExample 2')
print('to be contatenated={}'.format(dfr.loc[1:2,['B','C']]))
df2 = pd.concat([df,dfr.loc[1:2,['B','C']]])
print(df2)

#this will not concatenate in the examples above - index of wrong type
# df = makeDateRand()
# df.loc['20130102':'20130104',['A','B']]

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12

Example 1
to be contatenated=   A  B
1  2  5
2  3  6
   A  B
0  1  2
1  3  4
1  2  5
2  3  6

Example 2
to be contatenated=   B  C
1  5  8
2  6  9
    A  B   C
0   1  2 NaN
1   3  4 NaN
1 NaN  5   8
2 NaN  6   9


###SQL style merges

See the [Database style joining](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-join).

In [124]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print(left)
print(right)
pd.merge(left, right, on='key')

   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5


Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


###Handling duplicated rows

Finding duplicate rows, where the values in all the columns must be duplicates.  You can not mark either the first or last duplicated row.  The second example creates a new DataFrame containing only the duplicated rows, counting the number of duplicated rows.

In [125]:
df2 = makegridDF()
df2.loc[1,'A'] = 1
df2.loc[1,'B'] = 1
df2.loc[0,'B'] = 1
df2['isdup'] = df2.duplicated(subset=['A','B'])
print(df2)
df2['isdup'] = df2.duplicated(subset=['A','B'], take_last=True)
print(df2)

# create a new dataframe with the repeated rows
df = df2[df2.duplicated(subset=['A','B'], take_last=True)] 
print(len(df))
print(df)

   A  B  C   D  isdup
0  1  1  7  10  False
1  1  1  8  11   True
2  3  6  9  12  False
   A  B  C   D  isdup
0  1  1  7  10   True
1  1  1  8  11  False
2  3  6  9  12  False
1
   A  B  C   D isdup
0  1  1  7  10  True


The next example only checks for duplicates in column 'A' and then delete these row(s) from the DataFrame.

In [126]:
df2.drop_duplicates(subset=['A'], take_last=True, inplace=True)
df2

Unnamed: 0,A,B,C,D,isdup
1,1,1,8,11,False
2,3,6,9,12,False


The index value of any arbitrary row can be changed by making a list of the index, changing the value in the list and then re-assigning the list back to the DataFrame.

In [127]:
df2 = makegridDF()
df2.index = df2.index.tolist()[:-1]   + ['New Idx Value']
print(df2)

               A  B  C   D
0              1  4  7  10
1              2  5  8  11
New Idx Value  3  6  9  12


###Transpose a DataFrame

In [128]:
df = makegridDF()
print(df.T)
print(df.index)
print(df.columns)

    0   1   2
A   1   2   3
B   4   5   6
C   7   8   9
D  10  11  12
Int64Index([0, 1, 2], dtype='int64')
Index([u'A', u'B', u'C', u'D'], dtype='object')


###Selecting a subset of columns from a DataFrame

In [129]:
df = makegridDF()
print(df)
print(df.A)
print(df['A'])
print(df[['A','B']])

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
0    1
1    2
2    3
Name: A, dtype: int64
0    1
1    2
2    3
Name: A, dtype: int64
   A  B
0  1  4
1  2  5
2  3  6


Selecting a subset of columns may result in a copy or a view of the original DataFrame.  In this example a copy is made when selecting the columns, but a warning ensues when you try to assign a value to an element.  This warning arises because in some cases a view into the original DataFrame is returned and pandas cannot always know which form is used.

<http://stackoverflow.com/questions/11285613/selecting-columns>  

In [130]:
df = makegridDF()
df1 = df[['A','B']]
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


The more secure way (also get rid of the warning) to build a new DataFrame with a selection of columns is as follows:

In [131]:
df = makegridDF()
df1 = df.loc[:,['A','B']]
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


Another way would be to use `ix`.  However in this case a view is returned, which means that changing `df1` also changes the original DataFrame.

In [132]:
df = makegridDF()
# df1 = df.ix[:,slice('A','B')] # this and the following have same effect.
df1 = df.ix[:,0:2]  # this and the previous have same effect.
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
     A  B  C   D
0    1  4  7  10
1    2  5  8  11
2  100  6  9  12


A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


To force a copy of the original, use the `copy` method.

In [133]:
df = makegridDF()
df1 = df.ix[:,slice('A','B')].copy() # this and the following have same effect.
# df1 = df.ix[:,0:2].copy()  # this and the previous have same effect.
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


###Adding a column to a dataframe

It is relatively easy to add a column to an existing data frame. 

In [198]:
df = makeDateRand()
df['Total'] = df['A'] + df['B'] + df['C']
print(df)

                   A         B         C         D     Total
2013-01-01 -0.694530  1.934292 -0.347482  0.603173  0.892280
2013-01-02 -1.148821  0.575717  0.544431 -0.683324 -0.028674
2013-01-03  2.266512 -1.366276  0.413346  2.542879  1.313581
2013-01-04  0.934316  0.209193 -0.169195  0.256497  0.974314
2013-01-05 -0.011793 -0.731210  0.863213  0.958653  0.120210
2013-01-06 -0.018375  0.975624  2.796716 -0.972260  3.753965


String concatenation can be used across columns.

In [206]:
df = makegAlphaDF()
print(df)
df2 = df['A'] + df['B']
print(df2)

    A   B   C   D
0  0a  0b  0c  0d
1  1a  1b  1c  1d
2  2a  2b  2c  2d
0    0a0b
1    1a1b
2    2a2b
dtype: object


###Delete rows based on column value

DataFrame has an `isin` method. When calling `isin`, pass a set of values as either an array or dict. If values is an array, isin returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.

<http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-with-isin>  
<http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing>  

In the example below delete all rows where the value in column 'A' is in a given list.

In [135]:
df = makegridDF()
print(df)
idx = df['A'].isin([1,3])
df = df[~idx]
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
1  2  5  8  11


To match certain values in certain columns make a dict where the key is the column, and the value is a list of items you want to check for.  Combine DataFrame’s `isin()` with the `any()` and `all()` methods to quickly select subsets of your data that meet a given criteria. To select a row where each column meets its own criterion.

In the first example, remove all rows where the requirements for __all__ of the tests are met ('A' has 1 or 2, 'B' has 5 or 6, 'C' has 8 and 'D' has 11).

<http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.all.html#numpy.ndarray.all>

In [136]:
df = makegridDF()
print(df)
idx = df.isin({'A': [1,2], 'B': [5,6], 'C': [8], 'D': [11]})
print(idx)
idx = idx.all(axis=1)
print(idx)
df = df[~idx]
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
       A      B      C      D
0   True  False  False  False
1   True   True   True   True
2  False   True  False  False
0    False
1     True
2    False
dtype: bool
   A  B  C   D
0  1  4  7  10
2  3  6  9  12


In the second example, drop all rows where the requirements for __any__ of the tests are met ('A' has 1 or 2, 'B' has 5 or 6, 'C' has 8 and 'D' has 11).  In this case, it would drop all rows from the DataFrame, leaving it empty.

In [137]:
df = makegridDF()
print(df)
idx = df.isin({'A': [1,2], 'B': [5,6]})
print(idx)
idx = idx.any(axis=1)
print(idx)
df = df[~idx]
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
       A      B      C      D
0   True  False  False  False
1   True   True  False  False
2  False   True  False  False
0    True
1    True
2    True
dtype: bool
Empty DataFrame
Columns: [A, B, C, D]
Index: []


###Sorting

Sort by column index (axis=1), i.e., rearrange column order.

In [138]:
df = makeDateRand()
df.sort_index(axis=1, ascending=False)

NameError: global name 'nrow' is not defined

Sort all rows by row value in column B.

In [139]:
df.sort(columns='B')

Unnamed: 0,A,B,C,D


Sort all rows by row value in multiple columns.

In [140]:
df.sort(columns=['B', 'C'])

Unnamed: 0,A,B,C,D


You can introduce custom sorting by using categoricals.  In this example, first sort the 'G' column on default sorting (alphabetical).  Then redefine the 'G' column as a categorical with a specific sort order. Then re-sort, using the categorical sort order.

In [141]:
df = makecatedf()
print(df)
print(df.sort(columns='G'))
      
gsorter = ['fooz','fooa','foox','foon']
df.G = df.G.astype("category")
df.G.cat.set_categories(gsorter, inplace=True) 
print(df.sort(columns='G'))


   A          B  C  D      E    F     G
0  1 2013-01-02  1  3   test  foo  foox
1  1 2013-01-02  1  3  train  foo  fooa
2  1 2013-01-02  1  3   test  foo  foon
3  1 2013-01-02  1  3  train  foo  fooz
   A          B  C  D      E    F     G
1  1 2013-01-02  1  3  train  foo  fooa
2  1 2013-01-02  1  3   test  foo  foon
0  1 2013-01-02  1  3   test  foo  foox
3  1 2013-01-02  1  3  train  foo  fooz
   A          B  C  D      E    F     G
3  1 2013-01-02  1  3  train  foo  fooz
1  1 2013-01-02  1  3  train  foo  fooa
0  1 2013-01-02  1  3   test  foo  foox
2  1 2013-01-02  1  3   test  foo  foon


###Slicing and selecting sub-arrays

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, use the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix. 

[Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing)  
[MultiIndex / Advanced Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced)

The [pandas site](http://pandas.pydata.org/pandas-docs/stable/indexing.html) offers the following description:

Object selection has had a number of user-requested additions in order to support more explicit location based indexing. pandas now supports three types of multi-axis indexing.

1.    `.ix` supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type. `.ix` is the most general and will support any of the inputs in `.loc` and `.iloc`. `.ix` also supports floating point label schemes. .ix is exceptionally useful when dealing with mixed positional and label based hierarchical indexes.      However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use `.iloc` or `.loc`.    
       `.ix` does not and cannot guarantee that the label versus integer position resolution is perfect - you may run into [problems](https://github.com/pydata/pandas/issues/6683)  here.  `.ix` is an older method than than `.loc` and `.iloc` and was introduced to specifically prevent ambiguity by using stricted rules on data selection. `.ix` is faster than than `.loc` and `.iloc`

     See more at [Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced) and [Advanced Hierarchical](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-advanced-hierarchical).

1.    `.loc` is primarily label based, but may also be used with a boolean array. `.loc` will raise `KeyError` when the items are not found. Allowed inputs are:
       * A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
       * A list or array of labels ['a', 'b', 'c']
       * A slice object with labels 'a':'f', (note that contrary to usual python slices, **both** the start and the stop are included!)
       * A boolean array

     See more at [Selection by Label](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label)

1.     `.iloc` is primarily integer position based (from `0` to `length-1` of the axis), but may also be used with a boolean array. `.iloc`  will raise `IndexError` if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with python/numpy slice semantics). Allowed inputs are:
       * An integer e.g. 5
       * A list or array of integers [4, 3, 0]
       * A slice object with ints 1:7
       * A boolean array

     See more at [Selection by Position](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer)


Getting values from an object with multi-axes selection uses the following notation (using `.loc` as an example, but applies to `.iloc` and `.ix` as well). Any of the axes accessors may be the null slice `:`. Axes left out of the specification are assumed to be `:`. (e.g. `p.loc['a']` is equiv to `p.loc['a', :, :]`)

|Object Type |	Indexers|
|--|--|
|Series 	|`s.loc[indexer]`|
|DataFrame 	|`df.loc[row_indexer,column_indexer]`|
|Panel 	|`p.loc[item_indexer,major_indexer,minor_indexer]`|


<http://nbviewer.ipython.org/github/gboeing/python-cheat-sheets/blob/master/pandas-selecting.ipynb>  


###Conventional selection by column/index name

Selecting a single column with the form `df['A']`, yields a Series, equivalent to df.A.  
To select multiple columns  pass a list of column names as in `df[ ['A','B'] ]`.

In [142]:
df = makegridDF()
print(df.A)
print(df['A'])
print(df[['A','B']])

0    1
1    2
2    3
Name: A, dtype: int64
0    1
1    2
2    3
Name: A, dtype: int64
   A  B
0  1  4
1  2  5
2  3  6


Extract the Numpy array from the series in one of these two ways:

In [143]:
print(np.asarray(df['A']))
print(df['A'].values)

[1 2 3]
[1 2 3]


Slice rows using `df[]`, using index values or row numbers.   The row sequence can use slice notation, note that the upper bound is not included.  

This is the same form as used for columns above - somewhat confusing!

In [144]:
df = makeDateRand()
print(df)
print(df[2:4])

NameError: global name 'nrow' is not defined

In [145]:
print(df['2013-01-01':'2013-01-02'])

IndexError: invalid slice

###[`.ix` Conventional selection by label or position](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-integer)

You can use `ix` to select slices of the data frame.  

In [None]:
df = makeDateRand()
print(df)
print('')
print(df.ix[:, 'D']) # All rows in column D
print(df.ix[0:2, 0:2]) # upper left 2x2 sub-array, not including third [2] column 
print(df.ix[0:2, [0,2,3]]) # multiple columns in list format
print(df.ix[1:3, 'A':'C']) # use range of column names, same effect as above, note 'C' included!!
print(df.ix[2:4, ['A','C']]) # use list of column names
print(df.ix[1:3, 'B':]) # All columns onwards from 'B'
print(df.ix[1:3, :'C']) # All columns up to and including!! C
df.ix[1:3, :'C'] = -1
print(df)

To copy discontinuous column ranges takes a bit more effort. First create lists of the required columns

In [None]:
df = makeDateRand()
lst = list(df.columns[0:1]) + list(df.columns[2:3])
print(lst)
df1 = df[lst].copy() # copy was made, use this to get rid of the warning 
df1.ix[2,0] = +1000
print(df1)

df2 = df.ix[:,lst] # ix appears to have made a copy
df2.ix[2,0] = +1000
print(df2)
print(df)



###[`.loc` Selection by Label](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-label)

The following example is strange in the sense that is refers to the index by name (see the function where the DataFrame was created), but the index is not named.  Yet, it can/must be used, by using the Series name `dates`. This is probably because pandas  has strong support for time Series.

In [None]:
df = makeDateRand()
print(df)
print(df.index)
print(df.index.name)
print(df.columns)
print(df.loc[dates[0]])

In this example the row at count=0 is accessed just by the count number.  The index is not named.

In [None]:
df = makegridDF()
print(df)
print(df.index)
print(df.index.name)
print(df.loc[0])

Select all the rows, but only the 'A' and 'B' columns of these rows.

In [None]:
df.loc[:,['A','B']]

In the following example a slice is made on both rows and columns.  Note that when using `loc` both endpoints in the row range are returned, but in die `ix` case the upper bound must point to one beyond the end row.

In [None]:
df = makeDateRand()
print(df.loc['20130102':'20130104',['A','B']])
print(df.ix[1:4,['A','B']])

This example selects a row by using a dynamically generated datetime value.

In [None]:
df = makeDateRand()
df.ix[datetime(2013,01,02)]

Rows can also be selected by numeric index by using the `irow` method.

In [None]:
df = makeDateRand()
print(df)
print(df.irow(1))
print(df.irow(3))

This example iterates over all rows, assigning values to each row during iteration.

In [None]:
df = makeDateRand()
print(df.head())
for i,(idx, row) in enumerate(df.iterrows()):
    row['A'] = 2
    df.ix[idx, 'B'] = i
    df.ix[idx]['C'] = np.sqrt(i)
print(df.head())

In [None]:
#tbc

###[`.iloc` Selection by Position](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-integer)

The `iloc` 

In [None]:
df = makeDateRand()
df.iloc[3]

In [None]:
df.iloc[3:5,0:2]

In [None]:
df.iloc[[1,2,4],[0,2]]

In [None]:
#slicing rows
df.iloc[1:3,:]

In [None]:
#slicing columns
df.iloc[:,1:3]

In [None]:
df.iloc[1,1]

###Series/DataFrame [enlargement](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#setting-with-enlargement)

The `.loc`/`.ix/[]` operations can perform enlargement when setting a non-existent key for that axis. In the Series case this is effectively an appending operation.

In [None]:
se = pd.Series([1,2,3])
print(se) 
se[5] = 5.
print(se)

A DataFrame can be enlarged on either axis via `.loc`

In [None]:
dfi = pd.DataFrame(np.arange(6).reshape(3,2),columns=['A','B'])
print(dfi)
dfi.loc[:,'C'] = dfi.loc[:,'A']
print(dfi)
dfi.loc[3] = 5
print(dfi)

###Find row where index is nearest to given value

In [190]:
df = makeDateRand()
print(df)
print(df.iloc[np.argmin(np.abs(df.index.to_pydatetime() - datetime.datetime(2013,1,4)))]) # row
print(np.argmin(np.abs(df.index.to_pydatetime() - datetime.datetime(2013,1,4)))) # index

                   A         B         C         D
2013-01-01  0.325389 -2.409789 -1.111467 -1.272483
2013-01-02 -0.790742  0.218697 -0.357021  0.519667
2013-01-03  0.817886  0.969074 -0.971767  0.028446
2013-01-04 -1.678607 -0.197092  0.690585  0.481876
2013-01-05 -0.545952 -1.100644  0.822460 -0.521788
2013-01-06  1.811657 -0.771722 -1.178580  1.317708
A   -1.678607
B   -0.197092
C    0.690585
D    0.481876
Name: 2013-01-04 00:00:00, dtype: float64
3


In [191]:
df = makeRand()
print(df)
row = df.iloc[np.argmin(np.abs(df.index - 2))] # row
print(type(row))
print(row)
print(np.argmin(np.abs(df.index - 2))) # index

          A         B         C         D
0 -1.315140 -0.242211 -1.194011  0.623632
1  0.723594  1.256630 -1.018392  1.806375
2 -0.541730 -0.646721 -0.450718  1.455102
3  1.799430 -0.455019 -0.801430  0.030720
<class 'pandas.core.series.Series'>
A   -0.541730
B   -0.646721
C   -0.450718
D    1.455102
Name: 2, dtype: float64
2


###Find row where column is maximum

In [189]:
df = makeRand()
print(df)
print(df['A'].argmax(df['A'].argmax()))  # index
print(df.iloc[df['A'].argmax(df['A'].argmax())]) #row

          A         B         C         D
0 -0.858436  0.037875 -0.467995 -0.184116
1  1.435010  0.031610 -0.636843 -0.002721
2 -1.026010 -0.721476  0.015575 -0.804564
3  0.263172 -0.170815  0.286820 -1.346619
1
A    1.435010
B    0.031610
C   -0.636843
D   -0.002721
Name: 1, dtype: float64


###Find row where specific column has nearest value

In [192]:
df = makeRand()
print(df)
value = 0
print(df.iloc[np.argmin(np.abs(df['A'] - value))]) # row
print(np.argmin(np.abs(df['A'] - value))) # index

          A         B         C         D
0 -0.456681  0.940857  0.622991  1.105931
1 -0.472553  0.106307 -1.236626 -0.277683
2 -1.455653  0.194254 -0.818134 -0.435012
3 -1.406018 -0.202545  0.300187  1.133164
A   -0.456681
B    0.940857
C    0.622991
D    1.105931
Name: 0, dtype: float64
0


###Boolean indexing and filtering

In [147]:
#filter by single row
df = makeDateRand()
df[df.A > 0]

NameError: global name 'nrow' is not defined

In [59]:
#filter by multiple row
df2 = df[(df.A>0) & (df.B>0)]
df2

Unnamed: 0,A,B,C,D


In [60]:
#filter specs are pandas time series, which can be manipulated
filt = (df.A>0) & (df.B>0)
print(type(filt), filt)
print('filt.any() = {}'.format(filt.any()))
print('filt.all() = {}'.format(filt.all()))

(<class 'pandas.core.series.Series'>, 2013-01-01    False
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-05    False
2013-01-06    False
Freq: D, dtype: bool)
filt.any() = False
filt.all() = False


In [61]:
#filter by element
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,0.806293,,0.184366,
2013-01-02,,,,
2013-01-03,0.339057,,,0.28978
2013-01-04,,,0.783143,0.160725
2013-01-05,,0.884931,0.962539,
2013-01-06,,,2.330506,1.503962


In [62]:
#isin filtering
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
print(df2)
df2[df2['E'].isin(['two','four'])]

                   A         B         C         D      E
2013-01-01  0.806293 -1.264604  0.184366 -0.435645    one
2013-01-02 -0.105947 -0.639070 -0.341976 -0.682723    one
2013-01-03  0.339057 -1.960241 -2.204251  0.289780    two
2013-01-04 -1.852649 -1.315759  0.783143  0.160725  three
2013-01-05 -0.448001  0.884931  0.962539 -0.740747   four
2013-01-06 -0.687342 -0.248638  2.330506  1.503962  three


Unnamed: 0,A,B,C,D,E
2013-01-03,0.339057,-1.960241,-2.204251,0.28978,two
2013-01-05,-0.448001,0.884931,0.962539,-0.740747,four


In [63]:
#get unique values in a column
df = makefoobar()
print(df)
df.B.unique()

     A      B         C         D
0  foo    one  1.704886  1.170707
1  bar    one -0.813787 -0.153656
2  foo    two  2.170811 -0.318112
3  bar  three -0.986525  0.078411
4  foo    two  1.045485 -0.289139
5  bar    two -0.121183  0.066849
6  foo    one  0.465065 -0.929122
7  foo  three  1.328987  0.887484


array(['one', 'two', 'three'], dtype=object)

<http://stackoverflow.com/questions/20875140/apply-function-to-sets-of-columns-in-pandas-looping-over-entire-data-frame-co>  

What I want to do is simply to calculate the length of the vector for each header (A and B) in this case, for each index, and divide by the Time column. Hence, this function needs to be np.sqrt(A_x^2 + A_y^2 + A_z^2) and the same for B of course. I.e. I am looking to calculate the velocity for each row, but three columns contribute to one velocity result.      

In [52]:
#pandas approach
headers = ['Time', 'A_x', 'A_y', 'A_z', 'B_x', 'B_y', 'B_z']
df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

#fiter the column names to get a list of the ones you need
print(filter(lambda x: x.startswith("A_"),df.columns))

#get the columns according to names
print(df[filter(lambda x: x.startswith("A_"),df.columns)])

# do the apply dot product for each row across columns
column_initials = ["A","B"]
for column_initial in column_initials:
    df["Velocity_"+column_initial] = \
    df[filter(lambda x: x.startswith(column_initial+"_"),df.columns)].apply(lambda x: np.sqrt(x.dot(x)), axis=1)/df.Time
print(df)  


['A_x', 'A_y', 'A_z']
         A_x       A_y       A_z
1  -0.866036 -0.609962  0.697015
2  -0.361163 -1.337141 -0.083210
3   0.026910  1.382770 -0.982824
4   0.785285 -1.251536 -1.853590
5  -0.159702 -0.492080 -0.449791
6  -0.147551  0.324448  1.592172
7   0.929406 -0.242631  0.288650
8  -0.243551  1.071559  0.732014
9   0.175272 -0.932040  0.010235
10 -0.252762  0.981510 -0.725347
        Time       A_x       A_y       A_z       B_x       B_y       B_z  Velocity_A  Velocity_B
1  -1.347362 -0.866036 -0.609962  0.697015  1.203333  2.028204 -0.445130   -0.941121   -1.781223
2  -0.376973 -0.361163 -1.337141 -0.083210  0.468981 -0.441673 -1.301762   -3.680784   -3.852924
3  -1.090916  0.026910  1.382770 -0.982824 -1.924361  1.597575  1.986942   -1.555279   -2.928059
4  -0.669853  0.785285 -1.251536 -1.853590  0.924845 -0.109046  0.289680   -3.538689   -1.455939
5  -0.784623 -0.159702 -0.492080 -0.449791 -0.140404 -0.080874 -0.156391   -0.873714   -0.287007
6   0.329386 -0.147551  0.324448 

In [59]:
#numpy approach
headers = ['Time', 'A_x', 'A_y', 'A_z', 'B_x', 'B_y', 'B_z']
df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

arr = df.values
times = arr[:,0]
arr = arr[:,1:]
result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in list('AB')])
print(result)

   Velocity_A  Velocity_B
0   -1.912133   -1.947397
1   -0.590832   -2.002845
2    1.503772    2.322585
3   -1.979769   -0.270848
4    0.707002    3.005761
5    2.945552    3.070854
6    1.966063    1.221003
7   -2.675811   -1.822242
8    2.675041    2.190903
9    0.883929    3.713751


In [63]:
# yet another approach
headers = ['Time', 'A_x', 'A_y', 'A_z', 'B_x', 'B_y', 'B_z']
df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

result = df\
    .loc[:, df.columns!='Time']\
    .groupby(lambda x: x[0], axis=1)\
    .apply(lambda x: np.sqrt((x**2).sum(1)))\
    .apply(lambda x: x / df['Time'])

print(result)

           A         B
1  -1.766247 -3.242271
2   1.867759  2.212419
3   2.425740  2.026768
4   0.760418  0.262202
5  -0.367800 -0.943332
6   1.861910  1.598906
7  -2.862403 -4.881593
8  -4.030394 -6.074938
9  -1.299672 -1.657686
10  9.656592  6.961029


###Setting data

In [494]:
#Adding the sum along a column
df = makeDateRand()
df['A'].sum(), df['B'].sum(), df['C'].sum(), 

(3.3803161605457968, 5.1584851121598989, 2.1962012098384647)

In [500]:
df = makeDateRand()
df['Total'] = df['A'] + df['B'] + df['C']
print(df)

                   A         B         C         D     Total
2013-01-01  0.804121 -1.047518  0.814930  1.260297  0.571532
2013-01-02  0.076090 -0.301548 -0.165532  0.312223 -0.390991
2013-01-03  0.117382  0.427753 -0.590014 -0.128085 -0.044879
2013-01-04 -0.133791 -1.007004 -0.883338 -0.781047 -2.024132
2013-01-05  0.130961  1.715588  0.665942  0.841519  2.512491
2013-01-06 -0.069732  0.965853 -0.645569  0.891189  0.250552


In [501]:
sum_row = df[['A','B','Total']].sum()
sum_row

A        0.925031
B        0.753123
Total    0.874572
dtype: float64

We need to transpose the data and convert the Series to a DataFrame so that it is easier to concat onto our existing data. The T function allows us to switch the data from being row-based to column-based.



In [502]:
df_sum=pd.DataFrame(data=sum_row).T
df_sum

Unnamed: 0,A,B,Total
0,0.925031,0.753123,0.874572


The final thing we need to do before adding the totals back is to add the missing columns. We use reindex to do this for us. The trick is to add all of our columns and then allow pandas to fill in the values that are missing.


In [503]:
df_sum=df_sum.reindex(columns=df.columns)
df_sum

Unnamed: 0,A,B,C,D,Total
0,0.925031,0.753123,,,0.874572


Now append the totals to the end of the dataframe, rename the index value to use the word 'Total'.

In [504]:
df=df.append(df_sum,ignore_index=True)
df.index = df.index.tolist()[:-1]   + ['Total']
df.tail()

Unnamed: 0,A,B,C,D,Total
2,0.117382,0.427753,-0.590014,-0.128085,-0.044879
3,-0.133791,-1.007004,-0.883338,-0.781047,-2.024132
4,0.130961,1.715588,0.665942,0.841519,2.512491
5,-0.069732,0.965853,-0.645569,0.891189,0.250552
Total,0.925031,0.753123,,,0.874572


Setting a new column automatically aligns the data by the indexes

In [64]:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
df2['F'] = s1
df2

Unnamed: 0,A,B,C,D,E,F
2013-01-01,0.806293,-1.264604,0.184366,-0.435645,one,
2013-01-02,-0.105947,-0.63907,-0.341976,-0.682723,one,1.0
2013-01-03,0.339057,-1.960241,-2.204251,0.28978,two,2.0
2013-01-04,-1.852649,-1.315759,0.783143,0.160725,three,3.0
2013-01-05,-0.448001,0.884931,0.962539,-0.740747,four,4.0
2013-01-06,-0.687342,-0.248638,2.330506,1.503962,three,5.0


In [65]:
# Setting values by label
df.at[dates[0],'A'] = 0
df

Unnamed: 0,A,B,C,D
0,foo,one,1.704886,1.170707
1,bar,one,-0.813787,-0.153656
2,foo,two,2.170811,-0.318112
3,bar,three,-0.986525,0.078411
4,foo,two,1.045485,-0.289139
5,bar,two,-0.121183,0.066849
6,foo,one,0.465065,-0.929122
7,foo,three,1.328987,0.887484
2013-01-01 00:00:00,0,,,


In [66]:
# Setting values by position
df.iat[0,1] = 7
df

Unnamed: 0,A,B,C,D
0,foo,7,1.704886,1.170707
1,bar,one,-0.813787,-0.153656
2,foo,two,2.170811,-0.318112
3,bar,three,-0.986525,0.078411
4,foo,two,1.045485,-0.289139
5,bar,two,-0.121183,0.066849
6,foo,one,0.465065,-0.929122
7,foo,three,1.328987,0.887484
2013-01-01 00:00:00,0,,,


In [67]:
# Setting by assigning with a numpy array
df.loc[:,'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,D
0,foo,7,1.704886,5
1,bar,one,-0.813787,5
2,foo,two,2.170811,5
3,bar,three,-0.986525,5
4,foo,two,1.045485,5
5,bar,two,-0.121183,5
6,foo,one,0.465065,5
7,foo,three,1.328987,5
2013-01-01 00:00:00,0,,,5


In [68]:
# A where operation with setting.
df = makeDateRand()
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D
2013-01-01,-0.467656,-0.234445,-0.33796,-1.558072
2013-01-02,-0.960311,-0.434319,-1.64108,-0.333798
2013-01-03,-0.092155,-0.308327,-0.264266,-1.212069
2013-01-04,-0.732735,-0.024228,-1.537414,-2.531615
2013-01-05,-0.517563,-0.499382,-0.035378,-0.046663
2013-01-06,-1.134443,-0.309736,-1.180724,-0.518965


##Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the [Missing Data section](http://pandas.pydata.org/pandas-docs/version/0.15.2/missing_data.html#missing-data)

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [69]:
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.467656,-0.234445,0.33796,-1.558072,1.0
2013-01-02,0.960311,0.434319,1.64108,-0.333798,1.0
2013-01-03,-0.092155,-0.308327,-0.264266,-1.212069,
2013-01-04,0.732735,-0.024228,1.537414,2.531615,


To drop any rows that have missing data.

In [70]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.467656,-0.234445,0.33796,-1.558072,1
2013-01-02,0.960311,0.434319,1.64108,-0.333798,1


Filling missing data

In [71]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.467656,-0.234445,0.33796,-1.558072,1
2013-01-02,0.960311,0.434319,1.64108,-0.333798,1
2013-01-03,-0.092155,-0.308327,-0.264266,-1.212069,5
2013-01-04,0.732735,-0.024228,1.537414,2.531615,5


To get the boolean mask where values are nan

In [72]:
pd.isnull(df1)

Unnamed: 0,A,B,C,D,E
2013-01-01,False,False,False,False,False
2013-01-02,False,False,False,False,False
2013-01-03,False,False,False,False,True
2013-01-04,False,False,False,False,True


##Operations
###Binary operations

See the Basic section on [Binary Ops](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-binop)  

Operations in general exclude missing data.

In [73]:
df.mean()

A    0.291686
B    0.009494
C    0.744715
D   -0.016670
dtype: float64

In [74]:
#along the other axis
df.mean(1)

2013-01-01   -0.480553
2013-01-02    0.675478
2013-01-03   -0.469204
2013-01-04    1.194384
2013-01-05   -0.007366
2013-01-06    0.631099
Freq: D, dtype: float64

###Applying functions to the data

When using apply(), note the axis direction.   
- for each column, apply down a row: axis=0  (default)
- for each row, apply across columns: axis=1.

In [75]:
df = makegridDF()
print(df)
print(df.apply(np.cumsum))
print(df.apply(np.cumsum, axis=0))
print(df.apply(np.cumsum, axis=1))

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A   B   C   D
0  1   4   7  10
1  3   9  15  21
2  6  15  24  33
   A   B   C   D
0  1   4   7  10
1  3   9  15  21
2  6  15  24  33
   A  B   C   D
0  1  5  12  22
1  2  7  15  26
2  3  9  18  30


In [46]:
df = makeDateRand()
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D
2013-01-01,-0.347713,-0.907878,0.664533,0.330378
2013-01-02,0.925371,-0.07431,1.967896,-1.221288
2013-01-03,2.648762,1.003158,2.588638,-0.929843
2013-01-04,3.165779,0.714487,3.956195,-1.961708
2013-01-05,4.467329,2.50654,4.750387,-3.961429
2013-01-06,2.971142,2.393032,5.716535,-4.494308


In [47]:
df = makeDateRand()
print(df)
df.apply(lambda x: x.max() - x.min())

                   A         B         C         D
2013-01-01  1.458686 -1.186782 -1.559096 -0.980556
2013-01-02  0.467062 -1.274599  0.215515  0.820009
2013-01-03  0.500457 -2.254734  1.068736 -0.427718
2013-01-04 -1.742086  0.263522  0.334818  0.473579
2013-01-05 -0.887633 -1.092867  0.611888  0.853470
2013-01-06  1.459031  1.411495  0.639150 -0.957782


A    3.201117
B    3.666230
C    2.627831
D    1.834027
dtype: float64

In [78]:
from datetime import datetime
df = makeDateRand()
df.index.name = 'Date'
df.reset_index(level=0,inplace=True)
print(df)
#convert to string format
df.Date = df.Date.apply(lambda d: ' '.join(d.isoformat().split('T')))
print(df)
#convert back to datetime format
df.Date = df.Date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d %H:%M:%S"))
print(df)

df.index = df.Date
print(df)

        Date         A         B         C         D
0 2013-01-01  0.990319  0.174850 -0.091437  1.751810
1 2013-01-02  2.123177  0.510229 -1.563843  0.338379
2 2013-01-03 -0.808509  1.184882  2.537792  0.488163
3 2013-01-04  1.323992  1.144447  0.642264 -1.036075
4 2013-01-05  0.834353 -1.300357  0.184578 -1.460907
5 2013-01-06  0.231945  0.842537  0.117507  2.683755
                  Date         A         B         C         D
0  2013-01-01 00:00:00  0.990319  0.174850 -0.091437  1.751810
1  2013-01-02 00:00:00  2.123177  0.510229 -1.563843  0.338379
2  2013-01-03 00:00:00 -0.808509  1.184882  2.537792  0.488163
3  2013-01-04 00:00:00  1.323992  1.144447  0.642264 -1.036075
4  2013-01-05 00:00:00  0.834353 -1.300357  0.184578 -1.460907
5  2013-01-06 00:00:00  0.231945  0.842537  0.117507  2.683755
        Date         A         B         C         D
0 2013-01-01  0.990319  0.174850 -0.091437  1.751810
1 2013-01-02  2.123177  0.510229 -1.563843  0.338379
2 2013-01-03 -0.808509  1.184

###Histograms

[Histogramming and Discretization](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-discretization)

In [79]:
s = pd.Series(np.random.randint(0,7,size=10))
s.value_counts()

6    3
0    3
4    1
3    1
2    1
1    1
dtype: int64

### Strings
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses [regular expressions](https://docs.python.org/2/library/re.html) by default (and in some cases always uses them). See more at [Vectorized String Methods](http://pandas.pydata.org/pandas-docs/version/0.15.2/text.html#text-string-methods).

In [80]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
s.str.upper()
s.str.len()

0     1
1     1
2     1
3     4
4     4
5   NaN
6     4
7     3
8     3
dtype: float64

In [81]:
# Methods like split return a Series of lists:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_')

0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

In [82]:
s2.str.split('_').str[1]

0      b
1      d
2    NaN
3      g
dtype: object

In [83]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
s.str[1]

0    NaN
1    NaN
2    NaN
3      a
4      a
5    NaN
6      A
7      o
8      a
dtype: object

You can use [] notation to directly index by position locations. If you index past the end of the string, the result will be a NaN.

In [84]:
# Easy to expand this to return a DataFrame
s2.str.split('_').apply(pd.Series)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


Methods like replace and findall take regular expressions, too:

In [85]:
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3.str.replace('^.a|dog', 'XX-XX ', case=False)

0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

###Grouping
By “group by” we are referring to a process involving one or more of the following steps  

- Splitting the data into groups based on some criteria  
- Applying a function to each group independently    
- Combining the results into a data structure   

[grouping](http://pandas.pydata.org/pandas-docs/version/0.15.2/groupby.html#groupby)

In [90]:
df = makefoobar()
df

Unnamed: 0,A,B,C,D
0,foo,one,0.44883,1.250155
1,bar,one,0.60904,-0.216908
2,foo,two,-2.42389,-0.558726
3,bar,three,0.820044,1.364242
4,foo,two,-0.384493,0.178097
5,bar,two,0.119828,0.512074
6,foo,one,1.395724,0.145416
7,foo,three,-0.496975,0.298407


Grouping and then applying a function sum to the resulting groups.

In [91]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,1.548911,1.659408
foo,-1.460804,1.313348


In [92]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.60904,-0.216908
bar,three,0.820044,1.364242
bar,two,0.119828,0.512074
foo,one,1.844554,1.395571
foo,three,-0.496975,0.298407
foo,two,-2.808382,-0.380629


In [93]:
df = makefoobar()
print(df)
cnts = {}
#first value is group column value, seond value is the members in the group
for grp, grp_data in df.groupby("B"):
    cnts[grp] = grp_data.C.mean()  
cnts

     A      B         C         D
0  foo    one  0.345037  1.120149
1  bar    one -0.983207  1.304603
2  foo    two -0.469320  0.125918
3  bar  three  0.312737  0.146465
4  foo    two  0.590574 -0.566539
5  bar    two  0.666293  0.185597
6  foo    one -0.442139  1.421216
7  foo  three  0.899687  0.275622


{'one': -0.36010299230130288,
 'three': 0.60621232215056042,
 'two': 0.26251537978182976}

###Reshaping

[Hierarchical Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced-hierarchical) and [Reshaping](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-stacking).

In [139]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                    'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                    'one', 'two', 'one', 'two']]))
print(tuples)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2

[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.808066,-0.330584
bar,two,0.091519,-0.905225
baz,one,-0.210374,-0.477266
baz,two,0.0692,1.737672


The stack function “compresses” a level in the DataFrame’s columns.

In [95]:
stacked = df2.stack()
stacked

first  second   
bar    one     A    1.016842
               B    1.446916
       two     A   -1.342349
               B   -0.700087
baz    one     A    0.811508
               B   -0.058965
       two     A    0.337480
               B    1.236764
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:

In [96]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.016842,1.446916
bar,two,-1.342349,-0.700087
baz,one,0.811508,-0.058965
baz,two,0.33748,1.236764


In [97]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,1.016842,-1.342349
bar,B,1.446916,-0.700087
baz,A,0.811508,0.33748
baz,B,-0.058965,1.236764


In [98]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,1.016842,0.811508
one,B,1.446916,-0.058965
two,A,-1.342349,0.33748
two,B,-0.700087,1.236764


##Categoricals
see the [categorical introduction](http://pandas.pydata.org/pandas-docs/version/0.15.2/categorical.html#categorical) and the [API documentation](http://pandas.pydata.org/pandas-docs/version/0.15.2/api.html#api-categorical).

In [35]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
# Convert the raw grades to a categorical data type.
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a < b < e]

Rename the categories to more meaningful names (assigning to Series.cat.categories is in place!) Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series per default).

In [36]:
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad < bad < medium < good < very good]

In [37]:
# Sorting is per order in the categories, not lexical order.
df.sort("grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


In [38]:
# Grouping by a categorical column shows also empty categories.
df.groupby("grade").size()

grade
very bad      1
bad         NaN
medium      NaN
good          2
very good     3
dtype: float64

##Pivot tables
See the section on [Pivot Tables](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-pivot).

In [99]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,0.220966,-0.407925
1,one,B,foo,0.717566,0.335522
2,two,C,foo,0.273564,-0.904677
3,three,A,bar,-1.523647,-1.088773
4,one,B,bar,-1.980749,-0.37593
5,one,C,bar,-0.417124,-0.28941
6,two,A,foo,0.006494,-0.415106
7,three,B,foo,-0.7856,-0.970769
8,one,C,foo,-0.493092,0.351793
9,one,A,bar,-0.058732,-0.292374


In [100]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])


Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.058732,0.220966
one,B,-1.980749,0.717566
one,C,-0.417124,-0.493092
three,A,-1.523647,
three,B,,-0.7856
three,C,1.494698,
two,A,,0.006494
two,B,0.112476,
two,C,,0.273564


Reconstruct the sampled presented in the pages at [Pandas Pivot Table Explained ](http://pbpython.com/pandas-pivot-table-explained.html) and [Generating Excel Reports from a Pandas Pivot Table ](http://pbpython.com/pandas-pivot-report.html).  

First load the data and set the Status column as a pandas `category` and set the viewing order. Set the `Name` as table index.

In [34]:
df = pd.read_excel("./data/sales-funnel.xlsx")
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
print(df.head(4))

print(pd.pivot_table(df,index=["Name"]))


   Account                          Name           Rep       Manager      Product  Quantity  Price     Status
0   714466               Trantow-Barrows  Craig Booker  Debra Henley          CPU         1  30000  presented
1   714466               Trantow-Barrows  Craig Booker  Debra Henley     Software         1  10000  presented
2   714466               Trantow-Barrows  Craig Booker  Debra Henley  Maintenance         2   5000    pending
3   737550  Fritsch, Russel and Anderson  Craig Booker  Debra Henley          CPU         1  35000   declined
                              Account   Price  Quantity
Name                                                   
Barton LLC                     740150   35000  1.000000
Fritsch, Russel and Anderson   737550   35000  1.000000
Herman LLC                     141962   65000  2.000000
Jerde-Hilpert                  412290    5000  2.000000
Kassulke, Ondricka and Metz    307599    7000  3.000000
Keeling LLC                    688981  100000  5.000000
Ki

##Comparing and Gotchas
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-compare>  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#boolean-reductions>   

pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not.  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/gotchas.html#gotchas>


## Python and [module versions, and dates](http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-0-Scientific-Computing-with-Python.ipynb)

In [106]:
%load_ext version_information
%version_information pandas, numpy, scipy, matplotlib, pyradi

Software,Version
Python,2.7.8 32bit [MSC v.1500 32 bit (Intel)]
IPython,3.0.0
OS,Windows 7 6.1.7601 SP1
pandas,0.15.2
numpy,1.9.2
scipy,0.15.1
matplotlib,1.4.3
pyradi,0.1.55
Thu Apr 09 20:22:53 2015 South Africa Standard Time,Thu Apr 09 20:22:53 2015 South Africa Standard Time
