#Pandas Cheat Sheet

References:  
<http://pandas.pydata.org/pandas-docs/stable/basics.html>  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/10min.html>    
<http://synesthesiam.com/posts/an-introduction-to-pandas.html>  
<http://pbpython.com/excel-pandas-comp.html>  
<http://pbpython.com/excel-pandas-comp-2.html>  
<http://pbpython.com/improve-pandas-excel-output.html>  

<http://pbpython.com/pandas-pivot-table-explained.html.
<http://pbpython.com/pandas-pivot-report.html>

<http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook-multi-index>  

<http://www.bigdataexaminer.com/14-best-python-pandas-features/>  
<http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping>  
<https://iqbalnaved.wordpress.com/2013/08/26/python-pandas-hacks/>   

<https://plot.ly/ipython-notebooks/big-data-analytics-with-pandas-and-sqlite/>  
<http://www.analyticsvidhya.com/blog/2015/04/comprehensive-guide-data-exploration-sas-using-python-numpy-scipy-matplotlib-pandas/>  

<http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/>  
<http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/>  


In [1]:
import numpy as np
import pandas as pd

import datetime

<https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf>   

|Create pandas data structures| |
|--|--|
|s = Series(data, index) |Create a Series.|
|df = DataFrame (data, index, columns) |Create a Dataframe.|
|p = Panel(data, items, major_axis, minor_axis)|Create a Panel.|


|	DataFrame Commands	|		|
|--|--|
|	df[col]	|	Select column.	|
|	df.iloc[label]	|	Select row by label.	|
|	df.index	|	Return DataFrame index.	|
|	df.drop()	|	Delete given row or column. Pass axis=1 for columns.	|
|	df1 = df1.reindex_like(df1,df2)	|	Reindex df1 with index of df2.	|
|	df.reset_index()	|	Reset index, putting old index in column named index.	|
|	df.reindex()	|	Change DataFrame index, new indecies set to NaN.	|
|	df.head(n)	|	Show first n rows.	|
|	df.tail(n)	|	Show last n rows.	|
|	df.sort()	|	Sort index.	|
|	df.sort(axis=1)	|	Sort columns.	|
|	df.pivot(index,column,values)	|	Pivot DataFrame, using new conditions.	|
|	df.T	|	Transpose DataFrame.	|
|	df.stack()	|	Change lowest level of column labels into innermost row index.	|
|	df.unstack()	|	Change innermost row index into lowest level of column labels.	|
|	df.applymap()	|	Apply function to every element in DataFrame.	|
|	df.apply()	|	Apply function along a given axis	|
|	df.dropna()	|	Drops rows where any data is missing.	|
|	df.count()	|	Returns Series of row counts for every column.	|
|	df.min()	|	Return minimum of every column.	|
|	df.max()	|	Return maximum of every column.	|
|	df.describe()	|	Generate various summary statistics for every column.	|
|	concat()	|	Merge DataFrame or Series objects	|	

|	Groupby	|		|
|--|--|
|	groupby()	|	Split DataFrame by columns. Creates a GroupBy object (gb).	|
|	gb.agg()	|	Apply function (single or list) to a GroupBy object.	|
|	gb.transform()	|	Applies function and returns object with same index as one being grouped.	|
|	gb.filter()	|	Filter GroupBy object by a given function.	|
|	gb.groups	|	Return dict whose keys are the unique groups, and values are axis labels belonging to each group.	|

|	I/O	|		|
|--|--|
|	df.to_csv('foo.csv')	|	Save to CSV.	|
|	read_csv('foo.csv')	|	Read CSV into DataFrame.	|
|	to_excel('foo.xlsx', sheet_name)	|	Save to Excel.	|
|	read_excel('foo.xlsx','sheet1', index_col = None, na_values = ['NA'])	|	Read exel into DataFrame	|

	


##Set the display width when printing to console

There are quite a few options to configure here, if you're using ipython then tab complete to find the [full set](http://pandas.pydata.org/pandas-docs/version/0.15.2/options.html) of display options:

    pd.options.display.<tab>

<http://stackoverflow.com/questions/21249206/how-to-configure-display-output-in-ipython-pandas>

In [2]:
#maximum number of rows and columns displayed when a frame is pretty-printed
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 10)
# Width of the display in characters.
pd.set_option('display.width', 150)
# The maximum width in characters of a column in the repr of a pandas data structure.              
pd.set_option('display.max_colwidth', 150)

##Creating/Loading Data

###Functions to create different dataframe types

An empty DataFrame can be created as follows. Test to see if the DataFrame is empty. In this case it is.

In [3]:
columns = ['A','B', 'C']
df = pd.DataFrame(columns=columns)
print(df)
print(df.empty)

Empty DataFrame
Columns: [A, B, C]
Index: []
True


If you add an index, the row contents for the rows specified by the index will be empty (filled with NaN).  However, testing to see if the DataFrame is empty will show that it is not empty: there are rows.

In [4]:
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=3, freq='D')
columns = ['A','B', 'C']
df = pd.DataFrame(index=index, columns=columns)
print(df)
print(df.empty)
df = df.fillna(0) # with 0s rather than NaNs
print(df)

              A    B    C
2015-11-06  NaN  NaN  NaN
2015-11-07  NaN  NaN  NaN
2015-11-08  NaN  NaN  NaN
False
            A  B  C
2015-11-06  0  0  0
2015-11-07  0  0  0
2015-11-08  0  0  0


###Creating and filling DataFrames

The following  functions create pandas dataframes in a variety of ways.

In [5]:
# DataFrame by passing a numpy array, with a datetime index and labeled columns.
def makeDateRand(nrows=6, ncols=4):
    dates = pd.date_range('20130101',periods=6)
    df = pd.DataFrame(np.random.randn(nrows,ncols),index=dates,columns=list('ABCD')) 
    return df
# print(makeDateRand())

In [6]:
# DataFrame by passing a numpy array, with a datetime index and labeled columns, but also with a Date column.
def makeDateColRand(nrows=6, ncols=4):
    dates = pd.date_range('20130101',periods=6)
    df = pd.DataFrame(np.random.randn(nrows,ncols),index=dates,columns=list('ABCD')) 
    df['Date'] = df.index
    return df
print(makeDateColRand())

                   A         B         C         D       Date
2013-01-01 -0.717155 -0.481980 -0.954465  0.884755 2013-01-01
2013-01-02 -0.504780 -0.817341  0.427308 -1.813075 2013-01-02
2013-01-03 -0.337796  0.191699 -0.540361  1.072065 2013-01-03
2013-01-04  0.571985 -0.403270 -1.023337  1.127951 2013-01-04
2013-01-05 -0.686553 -0.440686 -0.285821 -0.507483 2013-01-05
2013-01-06  0.092287 -0.440438  0.167653 -0.917911 2013-01-06


In [7]:
# DataFrame by passing a numpy array, with an int index and labeled columns.
def makeRand(nrows=4, ncols=4):
    return pd.DataFrame(np.random.randn(nrows, ncols), columns=['A','B','C','D'])

In [8]:
#create from dictionary
def makefoobar():
    return  pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                          'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                          'C' : np.random.randn(8),
                          'D' : np.random.randn(8)})

In [9]:
#create from dictionary
def makegridDF():
    return  pd.DataFrame({'A' : [1,2,3],
                          'B' : [4,5,6],
                          'C' : [7,8,9],
                          'D' : [10,11,12]})

In [10]:
#create from dictionary with scalars
#ValueError: If use all scalar values, must pass index
def makegridScalarDF():
    return  pd.DataFrame(dct = {'A' : 1,'B' : 4,'C' : 9,'D' : 12}, index=[0])

In [11]:
#create from dictionary
def makegAlphaDF():
    return  pd.DataFrame({'A' : ['0a','1a','2a'],
                          'B' : ['0b','1b','2b'],
                          'C' : ['0c','1c','2c'],
                          'D' : ['0d','1d','2d']})

In [12]:
#create dataframe from a user-supplied string, using a user-defined regex separator 
def makeFromString(string, sep='\s+', header=False):
    from StringIO import StringIO
    return pd.read_csv(StringIO(string), sep=sep, header=header)
#alternative method
#     import io
#     return pd.read_table(io.BytesIO(content), sep=sep, header=header)

In [13]:
# DataFrame by passing a dict of objects that can be converted to series-like.
# using categorical in column E
def makecatedf():
    df = pd.DataFrame({'A' : 1.,
                       'B' : pd.Timestamp('20130102'),
                       'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                       'D' : np.array([3] * 4,dtype='int32'),
                       'E' : pd.Categorical(["test","train","test","train"]),
                       'F' : 'foo',
                       'G': ['foox','fooa','foon','fooz']})
    return (df)

In [14]:
#create a dataframe with a NaN
def makeNaNdf():
    return pd.DataFrame([[1, np.nan], [3, 4], [4,5]], columns=list('AB'))

In [15]:
#create a DataFrame with hierarchical column index
#From http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
def createMultColIdx():
    return pd.DataFrame([list('abcd'),
                  list('efgh'),
                  list('ijkl'),
                  list('mnop')],
                  columns=pd.MultiIndex.from_product([['one','two'],
                      ['first','second']]))

In [16]:
print(makeDateRand())

                   A         B         C         D
2013-01-01  0.823442  0.674689 -1.144674 -0.001726
2013-01-02  1.541514 -0.075816  1.107682  0.738387
2013-01-03  0.267011  2.764399 -0.675117 -0.503135
2013-01-04 -1.901062 -1.036676 -1.770690  0.572022
2013-01-05  1.393261  0.026678 -0.874046 -0.109330
2013-01-06 -0.165275 -0.104146  0.814309  1.910196


Display the data types

In [17]:
df2 = makecatedf()
print(df2.dtypes)
df2

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
G            object
dtype: object


Unnamed: 0,A,B,C,D,E,F,G
0,1,2013-01-02,1,3,test,foo,foox
1,1,2013-01-02,1,3,train,foo,fooa
2,1,2013-01-02,1,3,test,foo,foon
3,1,2013-01-02,1,3,train,foo,fooz


In [18]:
content = '''
Time       A_x       A_y       A_z       B_x       B_y       B_z
-0.075509 -0.123527 -0.547239 -0.453707 -0.969796  0.248761  1.369613
-0.133580 -0.308314 -0.839347 -0.517989  0.652120  0.477232 -0.391767
 0.623841  0.473552  0.059428  0.726088 -0.593291 -3.186297 -0.846863'''

makeFromString(content)

Unnamed: 0,Time,A_x,A_y,A_z,B_x,B_y,B_z
0,-0.075509,-0.123527,-0.547239,-0.453707,-0.969796,0.248761,1.369613
1,-0.13358,-0.308314,-0.839347,-0.517989,0.65212,0.477232,-0.391767
2,0.623841,0.473552,0.059428,0.726088,-0.593291,-3.186297,-0.846863


If you're using IPython, tab completion for column names (as well as public attributes) is automatically enabled:  
    `df2.<Tab>`

##Display two tables side-by-side

<https://gist.github.com/stefanv/6416926>  

In [19]:
class side_by_side():
    def __init__(self, *frames):
        self.frames = frames

    def _repr_html_(self):
        width = 100. / len(self.frames)

        s = ""
        for f in self.frames:
            s += "<div style='float: left;'>%s</div>" % f._repr_html_()

        return s

In [20]:
side_by_side(makeDateRand(), makeDateRand())

Unnamed: 0,A,B,C,D
2013-01-01,2.130532,-0.794176,-0.953936,-0.655333
2013-01-02,0.060027,-0.729829,-0.270028,-1.434621
2013-01-03,0.593224,0.738525,-0.594715,2.309201
2013-01-04,-0.309246,-0.922997,0.163288,-0.378091
2013-01-05,1.392123,0.639442,1.006117,0.081112
2013-01-06,-0.16998,0.707965,-0.628059,-1.052455

Unnamed: 0,A,B,C,D
2013-01-01,-1.254796,0.818573,0.563667,0.130845
2013-01-02,0.492384,-0.275045,-1.200332,0.450402
2013-01-03,-0.392504,0.256528,0.629999,-0.111044
2013-01-04,-0.402669,-1.997039,-1.324378,0.570704
2013-01-05,-0.584896,0.937321,0.853173,-0.623927
2013-01-06,-0.488186,0.411843,-0.63434,-1.147726


##File Input/Output

###CSV

In [21]:
df = makeDateRand()
df.to_csv('foo.csv')
pd.read_csv('foo.csv')

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2013-01-01,-0.676843,1.981695,-2.18527,0.757962
1,2013-01-02,0.612087,0.905173,-0.37204,-0.992448
2,2013-01-03,0.156287,-0.100006,0.849762,-0.075893
3,2013-01-04,-0.172648,2.291599,-0.164205,0.835155
4,2013-01-05,-0.117642,0.156827,0.674909,-0.07846
5,2013-01-06,-0.642434,-2.203255,0.107616,-0.614951


In [22]:
import pandas as pd
# use whitespace as separator
df = makeDateRand()
df.to_csv('foo.csv',sep=' ')
pd.read_table('foo.csv', sep='\s+')

Unnamed: 0,A,B,C,D
2013-01-01,-0.232628,-0.251621,-0.539267,0.855817
2013-01-02,-0.35485,1.076,0.915474,-0.119722
2013-01-03,1.26986,-3.249377,0.088521,0.332785
2013-01-04,0.673134,1.136173,0.550805,-1.343382
2013-01-05,0.987851,-0.003618,0.987889,0.49522
2013-01-06,0.146918,-1.81756,0.648169,0.289841


###Excel

In [23]:
df = makeDateRand()
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Unnamed: 0,A,B,C,D
2013-01-01,-0.135833,0.285342,0.501275,0.454495
2013-01-02,1.267054,0.51404,-0.721062,-1.300245
2013-01-03,1.524598,-0.697714,0.843707,-0.213828
2013-01-04,1.555808,0.028524,0.749512,1.155547
2013-01-05,2.223462,-1.831315,-2.682729,0.373763
2013-01-06,-0.193484,0.704739,-0.722458,0.630605


The `read_excel` function can read one or more worksheets from an Excel filename.  If only a single sheet is read, the single sheet data is returned as a dataframe.  If more than one (or all) sheets are read the dataframes are returned as a dictionary where the keys are the sheet names.

In [24]:
import pandas as pd
filename = 'data/atmos-elevation-angles.xlsx'
dictEff = pd.read_excel(filename,sheetname=None)
print(dictEff['SpecRanges'])
dfSpec = dictEff['SpecRanges']
specBand = 'MWIR'
print(dfSpec[specBand])
print('Spectral band {} is defined as {}-{} um'.format(specBand,dfSpec[specBand][0],dfSpec[specBand][1]))
print(dictEff['Sheet1'].head())


   LWIR  MWIR  NIR  SWIR  Visible
0     8   3.6  0.7   1.0     0.43
1    12   4.9  0.9   1.7     0.69
0    3.6
1    4.9
Name: MWIR, dtype: float64
Spectral band MWIR is defined as 3.6-4.9 um
                    Atmo  Altitude  Zenith SpecBand   ToaWattTot     ToaWatt  BoaWattTot     BoaWatt  LpathWatt                 ToaQTot  \
0  ExtremeHotLowHumidity         0       0      NIR  1398.766052  235.180000  947.004883  196.459289   5.364350  6690789636830180409344   
0  ExtremeHotLowHumidity         0       0  Visible  1398.766052  483.245923  947.004883  370.424402  20.394616  6690789636830180409344   
0  ExtremeHotLowHumidity         0       0     MWIR  1398.766052    9.748536  947.004883    5.206474   1.379844  6690789636830180409344   
0  ExtremeHotLowHumidity         0       0     LWIR  1398.766052    1.154950  947.004883    0.616622  17.365368  6690789636830180409344   
0  ExtremeHotLowHumidity         0       0     SWIR  1398.766052  292.085568  947.004883  191.096176   2.109043  6

###HDF5

Excel and CSV  formats can only store single elements per 'cell.   HDF5 provides the means to store hierarchical data, where some DataFrame cells can contain Numpy arrays or other structures.  The example below has a Numpy array in column 'array' and a pandas Series in column 'dframe'.  

Note that the Series index must match the dataframe index, otherwise the Series elements cannot be assigned (NaN are then assigned to all elements in the column).

<http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#io-hdf5>

In [25]:
df = makeDateRand()
df['arrays'] = [np.asarray([[1,x],[x/2,4*x]]) for x in range(6)]
ser = pd.Series([np.asarray([[1,x],[x/2,4*x]]) for x in range(6)],index=pd.date_range('20130101',periods=6))
print(ser)
df['dframe'] = ser
print(df)
df.to_hdf('df.hdf5','df',mode='w',append=False)
df = pd.read_hdf('df.hdf5', 'df')
print(df)
print(df.dtypes)

2013-01-01     [[1, 0], [0, 0]]
2013-01-02     [[1, 1], [0, 4]]
2013-01-03     [[1, 2], [1, 8]]
2013-01-04    [[1, 3], [1, 12]]
2013-01-05    [[1, 4], [2, 16]]
2013-01-06    [[1, 5], [2, 20]]
Freq: D, dtype: object
                   A         B         C         D             arrays             dframe
2013-01-01  0.271849 -0.777514  0.617433  1.165026   [[1, 0], [0, 0]]   [[1, 0], [0, 0]]
2013-01-02 -0.243265 -0.474100 -0.051533  0.652145   [[1, 1], [0, 4]]   [[1, 1], [0, 4]]
2013-01-03  0.970878 -0.551048 -1.358008  0.706246   [[1, 2], [1, 8]]   [[1, 2], [1, 8]]
2013-01-04  0.086831  1.385965 -0.301682  0.071798  [[1, 3], [1, 12]]  [[1, 3], [1, 12]]
2013-01-05  0.396771 -0.750442  0.025231 -0.961766  [[1, 4], [2, 16]]  [[1, 4], [2, 16]]
2013-01-06  2.467324  1.376785  0.959184 -0.747106  [[1, 5], [2, 20]]  [[1, 5], [2, 20]]
                   A         B         C         D             arrays             dframe
2013-01-01  0.271849 -0.777514  0.617433  1.165026   [[1, 0], [0, 0]]   [

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['arrays', 'dframe']]



Store and recover lists and dicts to/from HDF5 file

In [26]:
lst = [1,2,3]
dct = {'A' : 1,'B' : 4,'C' : 9,'D' : 12}
print(lst)
print(dct)
filename = 'test.hdf5'
store = pd.HDFStore(filename)
store['lst'] = pd.DataFrame(lst)
store['dct'] = pd.DataFrame(dct, index=[0])
store.close()

with pd.HDFStore(filename) as store:
    dlst = store['lst'][0].tolist()
    print(dlst)
    ddct = store['dct'].iloc[0].to_dict()
    print(ddct)

[1, 2, 3]
{'A': 1, 'C': 9, 'B': 4, 'D': 12}
[1, 2, 3]
{'A': 1, 'C': 9, 'B': 4, 'D': 12}


##Dataframe properties

###Row properties

The row count can be obtained in two different forms:

- The `len` and `shape` methods count the number of rows in the DataFrame, irrespective of the contents of the cells.
- The `df.count()` function returns a pandas series containing the number of valid entries in a column - ignoring NaN. 

In [27]:
df = makeNaNdf()
print(df)
print('\nlen(df) = {}'.format(len(df)))
print('\nshape[0] = {}'.format(df.shape[0]))
print('\ntype(df.count()) = {}'.format(type(df.count())))
print('\ndf.count() = \n{}'.format(df.count()))
print("\ndf.count()['A'] = {}".format(df.count()['A']))
print('\ndf.count()[1] = {}'.format(df.count()[1]))
print('\nNans = \n{}'.format(df.apply(lambda col: pd.isnull(col))))

   A   B
0  1 NaN
1  3   4
2  4   5

len(df) = 3

shape[0] = 3

type(df.count()) = <class 'pandas.core.series.Series'>

df.count() = 
A    3
B    2
dtype: int64

df.count()['A'] = 3

df.count()[1] = 2

Nans = 
       A      B
0  False   True
1  False  False
2  False  False


The DataDrame index (row names) can be retrieved as a [`pandas.index `](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html), which can be used to retrieve a list of the row names:

In [28]:
df = makegridDF()
print(df.index)
print(df.index.tolist())

Int64Index([0, 1, 2], dtype='int64')
[0, 1, 2]


If the DataDrame index is a more complex data type, the data type is returned:

In [29]:
df = makeDateRand()
print(df.index)
print(df.index.tolist()) # returns a list
print(df.index.values) # returns an array

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D', tz=None)
[Timestamp('2013-01-01 00:00:00', offset='D'), Timestamp('2013-01-02 00:00:00', offset='D'), Timestamp('2013-01-03 00:00:00', offset='D'), Timestamp('2013-01-04 00:00:00', offset='D'), Timestamp('2013-01-05 00:00:00', offset='D'), Timestamp('2013-01-06 00:00:00', offset='D')]
['2013-01-01T02:00:00.000000000+0200' '2013-01-02T02:00:00.000000000+0200'
 '2013-01-03T02:00:00.000000000+0200' '2013-01-04T02:00:00.000000000+0200'
 '2013-01-05T02:00:00.000000000+0200' '2013-01-06T02:00:00.000000000+0200']


In [30]:
#set / change the index name
df = makegridDF()
df.index.name = 'MyIndex'
print(df)
print(df.index.tolist()) # returns a list
print(df.index.values)  # returns an array

         A  B  C   D
MyIndex             
0        1  4  7  10
1        2  5  8  11
2        3  6  9  12
[0, 1, 2]
[0 1 2]


Use a column's values to set the index accordingly. Now that a repeat value appears to be allowed - strange.

In [31]:
df = makegridDF()
print(df)

df.loc[2,'A'] = 2
df.index = df.A
df.index.name = 'MyNewIndex'
print(df.loc[2,:])


   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
            A  B  C   D
MyNewIndex             
2           2  5  8  11
2           2  6  9  12


### Column properties

The column names can be retrieved as a [`pandas.index `](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html):

In [32]:
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

Get a list of the columns in dataframe - there are two ways to do this:

In [33]:
df = makeDateRand()
print(list(df.columns.values))
print(df.columns.values.tolist()) #fastest
print(list(df))

['A', 'B', 'C', 'D']
['A', 'B', 'C', 'D']
['A', 'B', 'C', 'D']


###DataFrame values

Get the values of the DataFrame contents as a Numpy array:

In [34]:
df = makeDateRand()
df.values

array([[ 0.05819467, -0.66685004, -0.03851065,  0.00923432],
       [ 0.69374828, -0.64996347, -1.91384252,  0.40468667],
       [-0.03553719,  1.64659896,  1.54464038, -0.97632196],
       [ 0.1954416 , -1.69944206,  0.36711453,  1.4470668 ],
       [ 0.09399458,  1.07250138, -0.63992774,  0.41560286],
       [ 0.23351216, -2.00076775,  0.40172087, -0.80809425]])

##NaN in Pandas / Numpy

Empty cells, or cells with missing data are filled with NaNs.  The example shows how to test for NaN values (`isnull()`).

In [35]:
a = np.nan
print(a)
print(pd.isnull(a))

nan
True


Use the `fillna()` function to fill NaN cells with some other value.

In [36]:
df = pd.DataFrame([[1, np.nan], [3, 4]], columns=list('AB'))
print(df)
print('\nNans = \n{}'.format(df.apply(lambda col: pd.isnull(col))))

#change all NaN to some other value
df.B = df.B.fillna('**')
print('\nNans replaced with ** = \n{}'.format(df))


   A   B
0  1 NaN
1  3   4

Nans = 
       A      B
0  False   True
1  False  False

Nans replaced with ** = 
   A   B
0  1  **
1  3   4


Creating a new dataframe by removing NaN in a row

In [37]:
df = pd.DataFrame([[1, np.nan], [3, 4]], columns=list('AB'))
print(df)
df = df[pd.notnull(df['B'])]
print(df)

   A   B
0  1 NaN
1  3   4
   A  B
1  3  4


##Manipulating DataFrames

###View of a DataFrame vs a Copy of a DataFrame

See [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy) for the full example.
           
More information on [multi-indexing](http://pandas-docs.github.io/pandas-docs-travis/advanced.html)

In [38]:
dfmi = createMultColIdx()
print(dfmi)

    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p


In the code below the first form `dfmi['one']['second']` is called the chained method, both using the `__getitem__` method, but happening in sequence.  The first call `dfmi['one']` returns a DataFrame which is input to the second call`(dfmi['one'])['second']` - these are two calls, one happening after the other.  

The second form `df.loc[:,('one','second')]`  passes a nested tuple to a single call to `__getitem__`, which can be significantly faster, and allows one to index both axes if so desired. Look at the name of the DataFrame returned in both cases and spot the difference.

In [39]:
print(dfmi['one']['second'])
print(dfmi.loc[:,('one','second')])


0    b
1    f
2    j
3    n
Name: second, dtype: object
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object


The first forms gives a `SettingWithCopyWarning` warning.  Since the chained indexing is 2 calls, it is possible that either call may return a copy of the data because of the way it is sliced. Thus when setting, you are actually setting a copy, and not the original frame data. 

The `.loc` operation is a single python operation, and thus can select a slice (which still may be a copy), but allows pandas to assign that slice back into the frame after it is modified, thus setting the values as you would think.

The reason for having the `SettingWithCopy` warning is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if it is sliced in a particular way. A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array.

In [40]:
dfmi['one']['second'] = 1 # assignment has no effect on the original!!
print(dfmi)

    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


To get the desired effect, use `.loc` to directly address the original DataFrame.  The `slice` method is used to select multiple column levels.

<http://pandas-docs.github.io/pandas-docs-travis/advanced.html#using-slicers>

In [41]:
dfmi.loc[:,slice('one','second')] = 1
print(dfmi)

    one          two       
  first second first second
0     1      1     c      d
1     1      1     g      h
2     1      1     k      l
3     1      1     o      p


###Dropping a row from a DataFrame

When using drop(), note the axis direction.   
- to drop a column axis=1  
- to drop a row axis=0  (default)

In [42]:
#drop a column
df = makegridDF()
#first drop the 'A' column
print(df.drop('A',axis=1))

   B  C   D
0  4  7  10
1  5  8  11
2  6  9  12


You can also use row indexes to drop columns.  The following example shows several index-based row drop methods.

In [43]:
df = makegridDF()
print(df)
print(df.drop(1)) # singe index
print(df.drop([1,2])) # list of indexes
print(df.drop(0,axis=0)) #single index explicit row selection
print(df.drop(df.index[[0,2]])) #list in the index 
for idx, row in df.iterrows():#iterate over all rows
    print(idx,row)
    df.drop(idx,inplace=True)
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
0  1  4  7  10
2  3  6  9  12
   A  B  C   D
0  1  4  7  10
   A  B  C   D
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
1  2  5  8  11
(0, A     1
B     4
C     7
D    10
Name: 0, dtype: int64)
(1, A     2
B     5
C     8
D    11
Name: 1, dtype: int64)
(2, A     3
B     6
C     9
D    12
Name: 2, dtype: int64)
Empty DataFrame
Columns: [A, B, C, D]
Index: []


In [44]:
#drop some of the rows
df = makeDateRand()
df = df.drop(df.index[[0,1,2,3]])
for idx, row in df.iterrows():
    print(idx,row)

(Timestamp('2013-01-05 00:00:00', offset='D'), A    0.034409
B    0.564287
C   -0.237843
D   -2.152395
Name: 2013-01-05 00:00:00, dtype: float64)
(Timestamp('2013-01-06 00:00:00', offset='D'), A   -1.605161
B   -0.474956
C    0.099835
D    0.516913
Name: 2013-01-06 00:00:00, dtype: float64)


In [45]:
#drop row based on value in a column
df = makegridDF()
print(df)
print(df[df['A'] >= 3])
print(df[(df['A'] >= 2) & (df['B']<6)])

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
2  3  6  9  12
   A  B  C   D
1  2  5  8  11


###Concatenation or Appending rows or DataFrames

`df.shape[0]` returns the number of rows already in the DataFrame (zero-based), hence if `df.shape[0]` is used as a row index, it will point to a new row immediately beyond the current last row.  This is an easy way to add row(s) to an existing DataFrame. 

Rows can be added to the DataFrame by [setting with enlargement](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#setting-with-enlargement).  The `df.loc[i]` (location) construct points to row `i`, which need not be an existing row.

In [46]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df)
df.loc[df.shape[0]] = ['a','b'] # add a row immediately beyond the current last
df.loc[df.shape[0]] = [np.nan,'new!'] # add a row immediately beyond the current last
print(df)

   A  B
0  1  2
1  3  4
     A     B
0    1     2
1    3     4
2    a     b
3  NaN  new!


Append rows to a dataframe. See the [Appending](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-concatenation).  This examples makes a copy of one of the rows and append it to the DataFrame.  Note that in this case a `copy()` is required to create a new DataFrame which is modified before appending.

In [47]:
df = makeRand()
s = df.iloc[2].copy() # copy is required, otherwise a view is taken
s[2] = 1000
print(s)
df.append(s, ignore_index=True)

A       2.362076
B       0.126814
C    1000.000000
D       1.175470
Name: 2, dtype: float64


Unnamed: 0,A,B,C,D
0,-0.168754,-1.493646,1.414307,-0.521216
1,-1.360728,-0.011363,-0.787047,-0.297332
2,2.362076,0.126814,-0.17782,1.17547
3,2.017522,-0.015841,0.66212,0.476828
4,2.362076,0.126814,1000.0,1.17547


The `append()` function can be used to add one more rows formed as DataFrames to an existing DataFrame.

In [48]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df = df.append(df2) # append row(s)
print(df)

   A  B
0  1  2
1  3  4
0  5  6
1  7  8


Concatenate some existing rows from the current DataFrame to itself.

In [49]:
df = makeRand()
df2 = pd.concat([df,df[2:4]])
df2

Unnamed: 0,A,B,C,D
0,0.326437,1.449927,-0.983553,1.130817
1,-1.470766,-1.545728,-0.91158,-0.942767
2,2.01104,-0.050539,0.327116,-1.256348
3,0.342107,0.61546,-0.186546,-0.621641
2,2.01104,-0.050539,0.327116,-1.256348
3,0.342107,0.61546,-0.186546,-0.621641


The following examples concatenates three views of a DataFrame.

In [50]:
# Concatenating pandas objects together
df = makeRand(10,4)
print(df)
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)


          A         B         C         D
0 -0.372643  0.674606  1.398682  0.072796
1  1.890054 -2.644753 -0.661703 -1.442987
2 -1.037201  1.382994  2.364432 -0.199727
3 -0.459675  0.395817 -0.021034 -0.388560
4  0.709492 -1.269402 -1.490404  0.585440
5  0.175227 -0.111574 -1.738165 -0.836384
6  1.210979  0.433999  0.074065 -0.611493
7  0.782219  0.922333 -0.209181 -0.576862
8 -0.950578  2.167220  0.546941 -0.339629
9  0.162805 -2.236091 -1.049877 -0.006776


Unnamed: 0,A,B,C,D
0,-0.372643,0.674606,1.398682,0.072796
1,1.890054,-2.644753,-0.661703,-1.442987
2,-1.037201,1.382994,2.364432,-0.199727
3,-0.459675,0.395817,-0.021034,-0.38856
4,0.709492,-1.269402,-1.490404,0.58544
5,0.175227,-0.111574,-1.738165,-0.836384
6,1.210979,0.433999,0.074065,-0.611493
7,0.782219,0.922333,-0.209181,-0.576862
8,-0.950578,2.16722,0.546941,-0.339629
9,0.162805,-2.236091,-1.049877,-0.006776


Concatenation stacks together rows from two arrays. In the example below the `df` array is concatenated with a 2x2 slice from `dfr`.  There are two observations from the code below:

- The column names are used when concatenating the rows. In the first example the column names are consistent and appended as expected. In the second example the row names do not exactly agree and cells with missing data are filled with NaN.
- The index data type of the concatenated DataFrame must be the same as the main DataFrame (hash error occurs otherwise).  For example if the index is the DateTime series, the concatenation will not work.

In [51]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
dfr = makegridDF()
print(dfr)
print('\nExample 1')
print('to be contatenated={}'.format(dfr.loc[1:2,['A','B']]))
df2 = pd.concat([df,dfr.loc[1:2,['A','B']]])
print(df2)

print('\nExample 2')
print('to be contatenated={}'.format(dfr.loc[1:2,['B','C']]))
df2 = pd.concat([df,dfr.loc[1:2,['B','C']]])
print(df2)

#this will not concatenate in the examples above - index of wrong type
# df = makeDateRand()
# df.loc['20130102':'20130104',['A','B']]

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12

Example 1
to be contatenated=   A  B
1  2  5
2  3  6
   A  B
0  1  2
1  3  4
1  2  5
2  3  6

Example 2
to be contatenated=   B  C
1  5  8
2  6  9
    A  B   C
0   1  2 NaN
1   3  4 NaN
1 NaN  5   8
2 NaN  6   9


###SQL style merges

See the [Database style joining](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-join).

In [52]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print(left)
print(right)
pd.merge(left, right, on='key')

   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5


Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


###Handling duplicated rows

Finding duplicate rows, where the values in all the columns must be duplicates.  You can not mark either the first or last duplicated row.  The second example creates a new DataFrame containing only the duplicated rows, counting the number of duplicated rows.

In [53]:
df2 = makegridDF()
df2.loc[1,'A'] = 1
df2.loc[1,'B'] = 1
df2.loc[0,'B'] = 1
df2['isdup'] = df2.duplicated(subset=['A','B'])
print(df2)
df2['isdup'] = df2.duplicated(subset=['A','B'], take_last=True)
print(df2)

# create a new dataframe with the repeated rows
df = df2[df2.duplicated(subset=['A','B'], take_last=True)] 
print(len(df))
print(df)

   A  B  C   D  isdup
0  1  1  7  10  False
1  1  1  8  11   True
2  3  6  9  12  False
   A  B  C   D  isdup
0  1  1  7  10   True
1  1  1  8  11  False
2  3  6  9  12  False
1
   A  B  C   D isdup
0  1  1  7  10  True


The next example only checks for duplicates in column 'A' and then delete these row(s) from the DataFrame.

In [54]:
df2.drop_duplicates(subset=['A'], take_last=True, inplace=True)
df2

Unnamed: 0,A,B,C,D,isdup
1,1,1,8,11,False
2,3,6,9,12,False


The index value of any arbitrary row can be changed by making a list of the index, changing the value in the list and then re-assigning the list back to the DataFrame.

In [55]:
df2 = makegridDF()
df2.index = df2.index.tolist()[:-1]   + ['New Idx Value']
print(df2)

               A  B  C   D
0              1  4  7  10
1              2  5  8  11
New Idx Value  3  6  9  12


###Transpose a DataFrame

In [56]:
df = makegridDF()
print(df.T)
print(df.index)
print(df.columns)

    0   1   2
A   1   2   3
B   4   5   6
C   7   8   9
D  10  11  12
Int64Index([0, 1, 2], dtype='int64')
Index([u'A', u'B', u'C', u'D'], dtype='object')


###Selecting a subset of columns from a DataFrame

In [57]:
df = makegridDF()
print(df)
print(df.A)
print(df['A'])
print(df[['A','B']])

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
0    1
1    2
2    3
Name: A, dtype: int64
0    1
1    2
2    3
Name: A, dtype: int64
   A  B
0  1  4
1  2  5
2  3  6


Selecting a subset of columns may result in a copy or a view of the original DataFrame.  In this example a copy is made when selecting the columns, but a warning ensues when you try to assign a value to an element.  This warning arises because in some cases a view into the original DataFrame is returned and pandas cannot always know which form is used.

<http://stackoverflow.com/questions/11285613/selecting-columns>  

In [58]:
df = makegridDF()
df1 = df[['A','B']]
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


The more secure way (also get rid of the warning) to build a new DataFrame with a selection of columns is as follows:

In [59]:
df = makegridDF()
df1 = df.loc[:,['A','B']]
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


Another way would be to use `ix`.  However in this case a view is returned, which means that changing `df1` also changes the original DataFrame.

In [60]:
df = makegridDF()
# df1 = df.ix[:,slice('A','B')] # this and the following have same effect.
df1 = df.ix[:,0:2]  # this and the previous have same effect.
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
     A  B  C   D
0    1  4  7  10
1    2  5  8  11
2  100  6  9  12


A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


To force a copy of the original, use the `copy` method.

In [61]:
df = makegridDF()
df1 = df.ix[:,slice('A','B')].copy() # this and the following have same effect.
# df1 = df.ix[:,0:2].copy()  # this and the previous have same effect.
print(df1)
df1.loc[2,'A'] = 100
print(df1)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
     A  B
0    1  4
1    2  5
2  100  6
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


###Adding a column to a dataframe

It is relatively easy to add a column to an existing data frame. 

In [62]:
df = makeDateRand()
df['Total'] = df['A'] + df['B'] + df['C']
print(df)

                   A         B         C         D     Total
2013-01-01  1.326938  1.344720  0.081126  0.631166  2.752785
2013-01-02 -1.028207 -0.285381  0.007923 -1.012329 -1.305665
2013-01-03  1.612248  2.562048 -0.291928 -0.012301  3.882368
2013-01-04  1.300778  0.149663  1.226523 -0.429995  2.676964
2013-01-05 -0.124523 -0.940318  1.527120  0.351434  0.462279
2013-01-06  0.215885 -0.726950  1.384604  0.577923  0.873539


String concatenation can be used across columns.

In [63]:
df = makegAlphaDF()
print(df)
df2 = df['A'] + df['B']
print(df2)

    A   B   C   D
0  0a  0b  0c  0d
1  1a  1b  1c  1d
2  2a  2b  2c  2d
0    0a0b
1    1a1b
2    2a2b
dtype: object


###Delete rows based on column value

DataFrame has an `isin` method. When calling `isin`, pass a set of values as either an array or dict. If values is an array, isin returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.

<http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-with-isin>  
<http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing>  

In the example below delete all rows where the value in column 'A' is in a given list.

In [64]:
df = makegridDF()
print(df)
idx = df['A'].isin([1,3])
df = df[~idx]
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A  B  C   D
1  2  5  8  11


To match certain values in certain columns make a dict where the key is the column, and the value is a list of items you want to check for.  Combine DataFrame’s `isin()` with the `any()` and `all()` methods to quickly select subsets of your data that meet a given criteria. To select a row where each column meets its own criterion.

In the first example, remove all rows where the requirements for __all__ of the tests are met ('A' has 1 or 2, 'B' has 5 or 6, 'C' has 8 and 'D' has 11).

<http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.all.html#numpy.ndarray.all>

In [65]:
df = makegridDF()
print(df)
idx = df.isin({'A': [1,2], 'B': [5,6], 'C': [8], 'D': [11]})
print(idx)
idx = idx.all(axis=1)
print(idx)
df = df[~idx]
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
       A      B      C      D
0   True  False  False  False
1   True   True   True   True
2  False   True  False  False
0    False
1     True
2    False
dtype: bool
   A  B  C   D
0  1  4  7  10
2  3  6  9  12


In the second example, drop all rows where the requirements for __any__ of the tests are met ('A' has 1 or 2, 'B' has 5 or 6, 'C' has 8 and 'D' has 11).  In this case, it would drop all rows from the DataFrame, leaving it empty.

In [66]:
df = makegridDF()
print(df)
idx = df.isin({'A': [1,2], 'B': [5,6]})
print(idx)
idx = idx.any(axis=1)
print(idx)
df = df[~idx]
print(df)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
       A      B      C      D
0   True  False  False  False
1   True   True  False  False
2  False   True  False  False
0    True
1    True
2    True
dtype: bool
Empty DataFrame
Columns: [A, B, C, D]
Index: []


###Sorting

Sort by column index (axis=1), i.e., rearrange column order.

In [67]:
df = makeDateRand()
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-1.294711,0.415103,0.908869,0.308472
2013-01-02,0.948132,0.16095,-0.478774,-0.419252
2013-01-03,-0.950577,0.211563,-0.047379,-0.704286
2013-01-04,0.256608,0.515264,-2.01366,0.372709
2013-01-05,0.296663,1.133033,0.909311,-1.311231
2013-01-06,0.46599,1.126065,-1.146329,-0.172921


Sort all rows by row value in column B.

In [68]:
df.sort(columns='B')

Unnamed: 0,A,B,C,D
2013-01-04,0.372709,-2.01366,0.515264,0.256608
2013-01-06,-0.172921,-1.146329,1.126065,0.46599
2013-01-02,-0.419252,-0.478774,0.16095,0.948132
2013-01-03,-0.704286,-0.047379,0.211563,-0.950577
2013-01-01,0.308472,0.908869,0.415103,-1.294711
2013-01-05,-1.311231,0.909311,1.133033,0.296663


Sort all rows by row value in multiple columns.

In [69]:
df.sort(columns=['B', 'C'])

Unnamed: 0,A,B,C,D
2013-01-04,0.372709,-2.01366,0.515264,0.256608
2013-01-06,-0.172921,-1.146329,1.126065,0.46599
2013-01-02,-0.419252,-0.478774,0.16095,0.948132
2013-01-03,-0.704286,-0.047379,0.211563,-0.950577
2013-01-01,0.308472,0.908869,0.415103,-1.294711
2013-01-05,-1.311231,0.909311,1.133033,0.296663


You can introduce custom sorting by using categoricals.  In this example, first sort the 'G' column on default sorting (alphabetical).  Then redefine the 'G' column as a categorical with a specific sort order. Then re-sort, using the categorical sort order.

In [70]:
df = makecatedf()
print(df)
print(df.sort(columns='G'))
      
gsorter = ['fooz','fooa','foox','foon']
df.G = df.G.astype("category")
df.G.cat.set_categories(gsorter, inplace=True) 
print(df.sort(columns='G'))


   A          B  C  D      E    F     G
0  1 2013-01-02  1  3   test  foo  foox
1  1 2013-01-02  1  3  train  foo  fooa
2  1 2013-01-02  1  3   test  foo  foon
3  1 2013-01-02  1  3  train  foo  fooz
   A          B  C  D      E    F     G
1  1 2013-01-02  1  3  train  foo  fooa
2  1 2013-01-02  1  3   test  foo  foon
0  1 2013-01-02  1  3   test  foo  foox
3  1 2013-01-02  1  3  train  foo  fooz
   A          B  C  D      E    F     G
3  1 2013-01-02  1  3  train  foo  fooz
1  1 2013-01-02  1  3  train  foo  fooa
0  1 2013-01-02  1  3   test  foo  foox
2  1 2013-01-02  1  3   test  foo  foon


###Slicing and selecting sub-arrays

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, use the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix. 

[Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing)  
[MultiIndex / Advanced Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced)

The [pandas site](http://pandas.pydata.org/pandas-docs/stable/indexing.html) offers the following description:

Object selection has had a number of user-requested additions in order to support more explicit location based indexing. pandas now supports three types of multi-axis indexing.

1.    `.ix` supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type. `.ix` is the most general and will support any of the inputs in `.loc` and `.iloc`. `.ix` also supports floating point label schemes. .ix is exceptionally useful when dealing with mixed positional and label based hierarchical indexes.      However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use `.iloc` or `.loc`.    
       `.ix` does not and cannot guarantee that the label versus integer position resolution is perfect - you may run into [problems](https://github.com/pydata/pandas/issues/6683)  here.  `.ix` is an older method than than `.loc` and `.iloc` and was introduced to specifically prevent ambiguity by using stricted rules on data selection. `.ix` is faster than than `.loc` and `.iloc`

     See more at [Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced) and [Advanced Hierarchical](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-advanced-hierarchical).

1.    `.loc` is primarily label based, but may also be used with a boolean array. `.loc` will raise `KeyError` when the items are not found. Allowed inputs are:
       * A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
       * A list or array of labels ['a', 'b', 'c']
       * A slice object with labels 'a':'f', (note that contrary to usual python slices, **both** the start and the stop are included!)
       * A boolean array

     See more at [Selection by Label](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label)

1.     `.iloc` is primarily integer position based (from `0` to `length-1` of the axis), but may also be used with a boolean array. `.iloc`  will raise `IndexError` if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with python/numpy slice semantics). Allowed inputs are:
       * An integer e.g. 5
       * A list or array of integers [4, 3, 0]
       * A slice object with ints 1:7
       * A boolean array

     See more at [Selection by Position](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer)


Getting values from an object with multi-axes selection uses the following notation (using `.loc` as an example, but applies to `.iloc` and `.ix` as well). Any of the axes accessors may be the null slice `:`. Axes left out of the specification are assumed to be `:`. (e.g. `p.loc['a']` is equiv to `p.loc['a', :, :]`)

|Object Type |	Indexers|
|--|--|
|Series 	|`s.loc[indexer]`|
|DataFrame 	|`df.loc[row_indexer,column_indexer]`|
|Panel 	|`p.loc[item_indexer,major_indexer,minor_indexer]`|


<http://nbviewer.ipython.org/github/gboeing/python-cheat-sheets/blob/master/pandas-selecting.ipynb>  


###Conventional selection by column/index name

Selecting a single column with the form `df['A']`, yields a Series, equivalent to df.A.  
To select multiple columns  pass a list of column names as in `df[ ['A','B'] ]`.

In [71]:
df = makegridDF()
print(df.A)
print(df['A'])
print(df[['A','B']])

0    1
1    2
2    3
Name: A, dtype: int64
0    1
1    2
2    3
Name: A, dtype: int64
   A  B
0  1  4
1  2  5
2  3  6


Extract the Numpy array from the series in one of these two ways:

In [72]:
print(np.asarray(df['A']))
print(df['A'].values)

[1 2 3]
[1 2 3]


Slice rows using `df[]`, using index values or row numbers.   The row sequence can use slice notation, note that the upper bound is not included.  

This is the same form as used for columns above - somewhat confusing!

In [73]:
df = makeDateRand()
print(df)
print(df[2:4])

                   A         B         C         D
2013-01-01  0.061052 -1.218434 -0.714600 -0.303349
2013-01-02 -0.856383  0.899420  0.060545  1.360151
2013-01-03 -0.687643 -0.143936 -1.681232  0.442991
2013-01-04  0.288631  0.439466  2.137891  0.778818
2013-01-05 -0.562437 -0.189158 -2.240910 -0.301650
2013-01-06 -0.007988  0.496398  0.651323  1.027091
                   A         B         C         D
2013-01-03 -0.687643 -0.143936 -1.681232  0.442991
2013-01-04  0.288631  0.439466  2.137891  0.778818


In [74]:
print(df['2013-01-01':'2013-01-02'])

                   A         B         C         D
2013-01-01  0.061052 -1.218434 -0.714600 -0.303349
2013-01-02 -0.856383  0.899420  0.060545  1.360151


###[`.ix` Conventional selection by label or position](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-integer)

You can use `ix` to select slices of the data frame.  

In [75]:
df = makeDateRand()
print(df)
print('')
print(df.ix[:, 'D']) # All rows in column D
print(df.ix[0:2, 0:2]) # upper left 2x2 sub-array, not including third [2] column 
print(df.ix[0:2, [0,2,3]]) # multiple columns in list format
print(df.ix[1:3, 'A':'C']) # use range of column names, same effect as above, note 'C' included!!
print(df.ix[2:4, ['A','C']]) # use list of column names
print(df.ix[1:3, 'B':]) # All columns onwards from 'B'
print(df.ix[1:3, :'C']) # All columns up to and including!! C
df.ix[1:3, :'C'] = -1
print(df)

                   A         B         C         D
2013-01-01  0.877056  1.162405 -2.024158  0.283611
2013-01-02 -0.251237 -1.470006 -0.191632  0.967703
2013-01-03  1.323086 -1.591721  0.106888  1.317541
2013-01-04 -0.486526 -1.127517 -0.859362  1.885716
2013-01-05  2.089531  0.529507  0.467789 -1.292124
2013-01-06  2.460774  0.444048  0.671873  2.575993

2013-01-01    0.283611
2013-01-02    0.967703
2013-01-03    1.317541
2013-01-04    1.885716
2013-01-05   -1.292124
2013-01-06    2.575993
Freq: D, Name: D, dtype: float64
                   A         B
2013-01-01  0.877056  1.162405
2013-01-02 -0.251237 -1.470006
                   A         C         D
2013-01-01  0.877056 -2.024158  0.283611
2013-01-02 -0.251237 -0.191632  0.967703
                   A         B         C
2013-01-02 -0.251237 -1.470006 -0.191632
2013-01-03  1.323086 -1.591721  0.106888
                   A         C
2013-01-03  1.323086  0.106888
2013-01-04 -0.486526 -0.859362
                   B         C         

To copy discontinuous column ranges takes a bit more effort. First create lists of the required columns

In [76]:
df = makeDateRand()
lst = list(df.columns[0:1]) + list(df.columns[2:3])
print(lst)
df1 = df[lst].copy() # copy was made, use this to get rid of the warning 
df1.ix[2,0] = +1000
print(df1)

df2 = df.ix[:,lst] # ix appears to have made a copy
df2.ix[2,0] = +1000
print(df2)
print(df)



['A', 'C']
                      A         C
2013-01-01    -0.055810 -1.077925
2013-01-02    -0.133790 -0.463038
2013-01-03  1000.000000 -0.886826
2013-01-04     0.038903  1.175930
2013-01-05     0.325994  0.720854
2013-01-06     1.746072 -0.156422
                      A         C
2013-01-01    -0.055810 -1.077925
2013-01-02    -0.133790 -0.463038
2013-01-03  1000.000000 -0.886826
2013-01-04     0.038903  1.175930
2013-01-05     0.325994  0.720854
2013-01-06     1.746072 -0.156422
                   A         B         C         D
2013-01-01 -0.055810  1.034389 -1.077925 -0.601989
2013-01-02 -0.133790 -0.557676 -0.463038  0.183704
2013-01-03  1.300642  2.900971 -0.886826 -1.305462
2013-01-04  0.038903  1.059843  1.175930 -0.253780
2013-01-05  0.325994  0.231051  0.720854  0.988304
2013-01-06  1.746072 -0.115538 -0.156422  0.238956


###[`.loc` Selection by Label](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-label)

The following example is strange in the sense that is refers to the index by name (see the function where the DataFrame was created), but the index is not named.  Yet, it can/must be used, by using the Series name `dates`. This is probably because pandas  has strong support for time Series.

In [77]:
df = makeDateRand()
dates = df.index
print(df)
print(df.index)
print(df.index.name)
print(df.columns)
print(df.loc[df.index[0]])
print(df.loc[dates[0]])


                   A         B         C         D
2013-01-01  1.127682 -0.655424  1.349564 -1.016169
2013-01-02  1.365243  0.087857  0.665405 -1.257426
2013-01-03 -0.222890 -0.532642  1.250088  0.290987
2013-01-04  1.913217  0.685340  1.620204 -0.128667
2013-01-05  0.883625  0.346807 -2.161959 -0.715071
2013-01-06 -2.649644  0.896102  0.612015 -0.760809
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D', tz=None)
None
Index([u'A', u'B', u'C', u'D'], dtype='object')
A    1.127682
B   -0.655424
C    1.349564
D   -1.016169
Name: 2013-01-01 00:00:00, dtype: float64
A    1.127682
B   -0.655424
C    1.349564
D   -1.016169
Name: 2013-01-01 00:00:00, dtype: float64


In this example the row at count=0 is accessed just by the count number.  The index is not named.

In [78]:
df = makegridDF()
print(df)
print(df.index)
print(df.index.name)
print(df.loc[0])

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
Int64Index([0, 1, 2], dtype='int64')
None
A     1
B     4
C     7
D    10
Name: 0, dtype: int64


Select all the rows, but only the 'A' and 'B' columns of these rows.

In [79]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In the following example a slice is made on both rows and columns.  Note that when using `loc` both endpoints in the row range are returned, but in die `ix` case the upper bound must point to one beyond the end row.

In [80]:
df = makeDateRand()
print(df.loc['20130102':'20130104',['A','B']])
print(df.ix[1:4,['A','B']])

                   A         B
2013-01-02 -0.495179 -0.225223
2013-01-03  0.778994  1.147324
2013-01-04  1.080913  0.591990
                   A         B
2013-01-02 -0.495179 -0.225223
2013-01-03  0.778994  1.147324
2013-01-04  1.080913  0.591990


This example selects a row by using a dynamically generated datetime value.

In [81]:
df = makeDateRand()
df.ix[datetime.datetime(2013,01,02)]

A    1.010345
B    0.762128
C   -0.320250
D   -0.327017
Name: 2013-01-02 00:00:00, dtype: float64

Rows can also be selected by numeric index by using the `irow` method.

In [82]:
df = makeDateRand()
print(df)
print(df.irow(1))
print(df.irow(3))

                   A         B         C         D
2013-01-01 -0.928176 -1.090193 -0.506929  0.223856
2013-01-02 -1.229843 -0.378317 -2.227370  0.440748
2013-01-03 -2.772242  1.553902 -0.384162  1.221619
2013-01-04 -0.049332  2.322150 -1.374806  2.245547
2013-01-05 -0.517569  1.521220  3.043147 -0.401948
2013-01-06  0.133225  1.352921  0.673246 -1.057114
A   -1.229843
B   -0.378317
C   -2.227370
D    0.440748
Name: 2013-01-02 00:00:00, dtype: float64
A   -0.049332
B    2.322150
C   -1.374806
D    2.245547
Name: 2013-01-04 00:00:00, dtype: float64


This example iterates over all rows, assigning values to each row during iteration.

In [83]:
df = makeDateRand()
print(df.head())
for i,(idx, row) in enumerate(df.iterrows()):
    row['A'] = 2
    df.ix[idx, 'B'] = i
    df.ix[idx]['C'] = np.sqrt(i)
print(df.head())

                   A         B         C         D
2013-01-01  0.214070 -1.012620 -0.621049 -0.863794
2013-01-02 -0.395997 -0.244937 -0.304091 -0.915254
2013-01-03 -1.897581 -1.197606  0.508874 -0.055936
2013-01-04 -0.393389 -1.240034  1.438438  0.000820
2013-01-05  0.356589 -1.550066 -0.423392  0.170878
            A  B         C         D
2013-01-01  2  0  0.000000 -0.863794
2013-01-02  2  1  1.000000 -0.915254
2013-01-03  2  2  1.414214 -0.055936
2013-01-04  2  3  1.732051  0.000820
2013-01-05  2  4  2.000000  0.170878


In [84]:
#tbc

###[`.iloc` Selection by Position](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-integer)

The `iloc` 

In [85]:
df = makeDateRand()
df.iloc[3]

A    0.801783
B    1.170194
C    0.505377
D    0.469312
Name: 2013-01-04 00:00:00, dtype: float64

In [86]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,0.801783,1.170194
2013-01-05,-2.257584,0.787157


In [87]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,-0.585347,0.153459
2013-01-03,1.456779,1.609178
2013-01-05,-2.257584,-0.466139


In [88]:
#slicing rows
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2013-01-02,-0.585347,0.085065,0.153459,0.163826
2013-01-03,1.456779,-0.770183,1.609178,-0.378438


In [89]:
#slicing columns
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,0.850289,-0.393127
2013-01-02,0.085065,0.153459
2013-01-03,-0.770183,1.609178
2013-01-04,1.170194,0.505377
2013-01-05,0.787157,-0.466139
2013-01-06,-0.473635,1.200789


In [90]:
df.iloc[1,1]

0.085065235275514364

###Series/DataFrame [enlargement](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#setting-with-enlargement)

The `.loc`/`.ix/[]` operations can perform enlargement when setting a non-existent key for that axis. In the Series case this is effectively an appending operation.

In [91]:
se = pd.Series([1,2,3])
print(se) 
se[5] = 5.
print(se)

0    1
1    2
2    3
dtype: int64
0    1
1    2
2    3
5    5
dtype: float64


A DataFrame can be enlarged on either axis via `.loc`

In [92]:
dfi = pd.DataFrame(np.arange(6).reshape(3,2),columns=['A','B'])
print(dfi)
dfi.loc[:,'C'] = dfi.loc[:,'A']
print(dfi)
dfi.loc[3] = 5
print(dfi)

   A  B
0  0  1
1  2  3
2  4  5
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5


###Find row where index is nearest to given value

In [93]:
df = makeDateRand()
print(df)
print(df.iloc[np.argmin(np.abs(df.index.to_pydatetime() - datetime.datetime(2013,1,4)))]) # row
print(np.argmin(np.abs(df.index.to_pydatetime() - datetime.datetime(2013,1,4)))) # index

                   A         B         C         D
2013-01-01  0.828168  0.786836 -1.263948 -0.208430
2013-01-02  0.477568  0.503814  0.569845  1.401609
2013-01-03 -0.865650 -0.846171  1.848966  1.780703
2013-01-04 -2.208243 -1.413001  0.460357 -0.863078
2013-01-05 -0.382128 -0.063438 -0.316363  0.439247
2013-01-06  0.535648 -0.920858 -0.616101  1.498911
A   -2.208243
B   -1.413001
C    0.460357
D   -0.863078
Name: 2013-01-04 00:00:00, dtype: float64
3


In [94]:
df = makeRand()
print(df)
row = df.iloc[np.argmin(np.abs(df.index - 2))] # row
print(type(row))
print(row)
print(np.argmin(np.abs(df.index - 2))) # index

          A         B         C         D
0 -1.372036  0.939721  2.353927  0.462340
1  0.733601  1.806252  0.512340 -0.651053
2  1.148065 -1.274966  0.667707 -1.112733
3  0.383228 -0.875551  0.409569 -2.289556
<class 'pandas.core.series.Series'>
A    1.148065
B   -1.274966
C    0.667707
D   -1.112733
Name: 2, dtype: float64
2


###Find row where column is maximum

In [95]:
df = makeRand()
print(df)
print(df['A'].argmax(df['A'].argmax()))  # index
print(df.iloc[df['A'].argmax(df['A'].argmax())]) #row

          A         B         C         D
0  0.162660  1.720285  0.454165 -0.623256
1 -1.106578  0.134310 -2.035434 -0.752612
2  0.675271  0.266502 -0.504802  0.102842
3 -0.478082  0.008230 -1.096827  0.444209
2
A    0.675271
B    0.266502
C   -0.504802
D    0.102842
Name: 2, dtype: float64


###Find row where specific column has nearest value

In [96]:
df = makeRand()
print(df)
value = 0
print(df.iloc[np.argmin(np.abs(df['A'] - value))]) # row
print(np.argmin(np.abs(df['A'] - value))) # index

          A         B         C         D
0 -0.518847  0.075705  0.713938 -0.168303
1  0.293792  2.167528 -0.625907 -0.032989
2 -0.850298 -1.565465 -0.359839 -1.147710
3 -0.714039  0.572752 -0.518094 -0.153061
A    0.293792
B    2.167528
C   -0.625907
D   -0.032989
Name: 1, dtype: float64
1


###Boolean indexing and filtering

In [97]:
#filter by single row
df = makeDateRand()
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-04,1.072648,-0.799697,0.20344,0.776776
2013-01-05,1.124499,-0.627949,0.762926,-0.140655
2013-01-06,0.455861,0.095687,-1.244971,-0.752313


In [98]:
#filter by multiple row
df2 = df[(df.A>0) & (df.B>0)]
df2

Unnamed: 0,A,B,C,D
2013-01-06,0.455861,0.095687,-1.244971,-0.752313


In [99]:
#filter specs are pandas time series, which can be manipulated
filt = (df.A>0) & (df.B>0)
print(type(filt), filt)
print('filt.any() = {}'.format(filt.any()))
print('filt.all() = {}'.format(filt.all()))

(<class 'pandas.core.series.Series'>, 2013-01-01    False
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-05    False
2013-01-06     True
Freq: D, dtype: bool)
filt.any() = True
filt.all() = False


In [100]:
#filter by element
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,,,0.18821
2013-01-02,,0.54077,,
2013-01-03,,,0.511449,
2013-01-04,1.072648,,0.20344,0.776776
2013-01-05,1.124499,,0.762926,
2013-01-06,0.455861,0.095687,,


In [101]:
#isin filtering
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
print(df2)
df2[df2['E'].isin(['two','four'])]

                   A         B         C         D      E
2013-01-01 -0.995301 -0.873578 -0.316951  0.188210    one
2013-01-02 -0.706317  0.540770 -0.575105 -0.847466    one
2013-01-03 -0.068224 -0.334110  0.511449 -0.256995    two
2013-01-04  1.072648 -0.799697  0.203440  0.776776  three
2013-01-05  1.124499 -0.627949  0.762926 -0.140655   four
2013-01-06  0.455861  0.095687 -1.244971 -0.752313  three


Unnamed: 0,A,B,C,D,E
2013-01-03,-0.068224,-0.33411,0.511449,-0.256995,two
2013-01-05,1.124499,-0.627949,0.762926,-0.140655,four


In [102]:
#get unique values in a column
df = makefoobar()
print(df)
df.B.unique()

     A      B         C         D
0  foo    one -0.211789 -0.151506
1  bar    one -0.407849  2.232608
2  foo    two -1.873665  0.211750
3  bar  three -0.551327 -0.441149
4  foo    two -1.972658  0.970826
5  bar    two -0.574543 -0.292994
6  foo    one -1.033056 -1.305929
7  foo  three  1.645788 -1.033893


array(['one', 'two', 'three'], dtype=object)

<http://stackoverflow.com/questions/20875140/apply-function-to-sets-of-columns-in-pandas-looping-over-entire-data-frame-co>  

What I want to do is simply to calculate the length of the vector for each header (A and B) in this case, for each index, and divide by the Time column. Hence, this function needs to be np.sqrt(A_x^2 + A_y^2 + A_z^2) and the same for B of course. I.e. I am looking to calculate the velocity for each row, but three columns contribute to one velocity result.      

In [103]:
#pandas approach
headers = ['Time', 'A_x', 'A_y', 'A_z', 'B_x', 'B_y', 'B_z']
df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

#fiter the column names to get a list of the ones you need
print(filter(lambda x: x.startswith("A_"),df.columns))

#get the columns according to names
print(df[filter(lambda x: x.startswith("A_"),df.columns)])

# do the apply dot product for each row across columns
column_initials = ["A","B"]
for column_initial in column_initials:
    df["Velocity_"+column_initial] = \
    df[filter(lambda x: x.startswith(column_initial+"_"),df.columns)].apply(lambda x: np.sqrt(x.dot(x)), axis=1)/df.Time
print(df)  


['A_x', 'A_y', 'A_z']
         A_x       A_y       A_z
1  -0.466980  1.732722  0.944351
2   0.972732  0.984821  0.277257
3  -1.110719 -0.291709 -0.664218
4  -0.999865 -0.142255 -2.245898
5   1.248955 -0.043165 -0.098336
6   0.133296 -0.050954 -1.620492
7  -1.401393  1.276168 -2.296349
8  -0.006538  0.488665  0.737432
9  -1.106685  0.043987 -0.038015
10  1.903857 -0.842152 -0.379504
        Time       A_x       A_y       A_z       B_x       B_y       B_z  Velocity_A  Velocity_B
1   1.348334 -0.466980  1.732722  0.944351  0.520680 -0.123358  0.549939    1.503971    0.569077
2   0.713104  0.972732  0.984821  0.277257  0.539678  0.173024 -1.226097    1.979681    1.894172
3   0.431897 -1.110719 -0.291709 -0.664218 -0.017614  0.798333  0.556161    3.071662    2.253128
4   0.905709 -0.999865 -0.142255 -2.245898  0.315103 -0.374617  0.246283    2.718890    0.605030
5  -1.888020  1.248955 -0.043165 -0.098336 -1.254201  0.898557 -0.194119   -0.663957   -0.823628
6   2.234962  0.133296 -0.050954 

In [104]:
#numpy approach
headers = ['Time', 'A_x', 'A_y', 'A_z', 'B_x', 'B_y', 'B_z']
df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

arr = df.values
times = arr[:,0]
arr = arr[:,1:]
result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in list('AB')])
print(result)

   Velocity_A  Velocity_B
0  176.289366  289.079441
1    0.682697    0.870937
2   -2.013530   -2.167332
3   -0.327317   -1.519677
4    3.551739    1.916660
5    2.154119    1.843357
6   -2.299107   -0.679189
7    3.211561    3.722065
8    2.906667    3.054179
9   -2.755288   -2.863120


In [105]:
# yet another approach
headers = ['Time', 'A_x', 'A_y', 'A_z', 'B_x', 'B_y', 'B_z']
df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)

result = df\
    .loc[:, df.columns!='Time']\
    .groupby(lambda x: x[0], axis=1)\
    .apply(lambda x: np.sqrt((x**2).sum(1)))\
    .apply(lambda x: x / df['Time'])

print(result)

            A          B
1   -4.451987  -2.753416
2    1.953748   2.402162
3   -1.437867  -1.180572
4    0.824325   1.335514
5  -20.249689 -16.340709
6  -29.945239 -23.319525
7    0.574743   1.195075
8    0.708248   0.479328
9   -7.005165  -7.438826
10  -1.849556  -2.245327


###Setting data

In [106]:
#Adding the sum along a column
df = makeDateRand()
df['A'].sum(), df['B'].sum(), df['C'].sum(), 

(-3.780464436670412, 2.896084492571287, -2.0088581321148458)

In [107]:
df = makeDateRand()
df['Total'] = df['A'] + df['B'] + df['C']
print(df)

                   A         B         C         D     Total
2013-01-01 -0.393262  0.240626 -0.495570 -0.397892 -0.648206
2013-01-02  0.324335  0.743724  1.418048  0.657281  2.486107
2013-01-03 -1.446628  0.348666  0.388317  0.279226 -0.709646
2013-01-04 -0.629096 -1.216620  0.304627  0.087036 -1.541089
2013-01-05  0.845574  2.241259 -0.258002  1.537793  2.828831
2013-01-06 -1.467640 -1.330615  0.748392 -2.499363 -2.049863


In [108]:
sum_row = df[['A','B','Total']].sum()
sum_row

A       -2.766717
B        1.027040
Total    0.366134
dtype: float64

We need to transpose the data and convert the Series to a DataFrame so that it is easier to concat onto our existing data. The T function allows us to switch the data from being row-based to column-based.



In [109]:
df_sum=pd.DataFrame(data=sum_row).T
df_sum

Unnamed: 0,A,B,Total
0,-2.766717,1.02704,0.366134


The final thing we need to do before adding the totals back is to add the missing columns. We use reindex to do this for us. The trick is to add all of our columns and then allow pandas to fill in the values that are missing.


In [110]:
df_sum=df_sum.reindex(columns=df.columns)
df_sum

Unnamed: 0,A,B,C,D,Total
0,-2.766717,1.02704,,,0.366134


Now append the totals to the end of the dataframe, rename the index value to use the word 'Total'.

In [111]:
df=df.append(df_sum,ignore_index=True)
df.index = df.index.tolist()[:-1]   + ['Total']
df.tail()

Unnamed: 0,A,B,C,D,Total
2,-1.446628,0.348666,0.388317,0.279226,-0.709646
3,-0.629096,-1.21662,0.304627,0.087036,-1.541089
4,0.845574,2.241259,-0.258002,1.537793,2.828831
5,-1.46764,-1.330615,0.748392,-2.499363,-2.049863
Total,-2.766717,1.02704,,,0.366134


Setting a new column automatically aligns the data by the indexes

In [112]:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
df2['F'] = s1
df2

Unnamed: 0,A,B,C,D,E,F
2013-01-01,-0.995301,-0.873578,-0.316951,0.18821,one,
2013-01-02,-0.706317,0.54077,-0.575105,-0.847466,one,1.0
2013-01-03,-0.068224,-0.33411,0.511449,-0.256995,two,2.0
2013-01-04,1.072648,-0.799697,0.20344,0.776776,three,3.0
2013-01-05,1.124499,-0.627949,0.762926,-0.140655,four,4.0
2013-01-06,0.455861,0.095687,-1.244971,-0.752313,three,5.0


In [113]:
# Setting values by label
dates = df.index
df.at[dates[0],'A'] = 0
df

Unnamed: 0,A,B,C,D,Total
0,0.0,0.240626,-0.49557,-0.397892,-0.648206
1,0.324335,0.743724,1.418048,0.657281,2.486107
2,-1.446628,0.348666,0.388317,0.279226,-0.709646
3,-0.629096,-1.21662,0.304627,0.087036,-1.541089
4,0.845574,2.241259,-0.258002,1.537793,2.828831
5,-1.46764,-1.330615,0.748392,-2.499363,-2.049863
Total,-2.766717,1.02704,,,0.366134


In [114]:
# Setting values by position
df.iat[0,1] = 7
df

Unnamed: 0,A,B,C,D,Total
0,0.0,7.0,-0.49557,-0.397892,-0.648206
1,0.324335,0.743724,1.418048,0.657281,2.486107
2,-1.446628,0.348666,0.388317,0.279226,-0.709646
3,-0.629096,-1.21662,0.304627,0.087036,-1.541089
4,0.845574,2.241259,-0.258002,1.537793,2.828831
5,-1.46764,-1.330615,0.748392,-2.499363,-2.049863
Total,-2.766717,1.02704,,,0.366134


In [115]:
# Setting by assigning with a numpy array
df.loc[:,'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,D,Total
0,0.0,7.0,-0.49557,5,-0.648206
1,0.324335,0.743724,1.418048,5,2.486107
2,-1.446628,0.348666,0.388317,5,-0.709646
3,-0.629096,-1.21662,0.304627,5,-1.541089
4,0.845574,2.241259,-0.258002,5,2.828831
5,-1.46764,-1.330615,0.748392,5,-2.049863
Total,-2.766717,1.02704,,5,0.366134


In [116]:
# A where operation with setting.
df = makeDateRand()
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D
2013-01-01,-3.29863,-0.657871,-0.145908,-0.250304
2013-01-02,-1.802052,-0.132991,-0.268343,-0.48568
2013-01-03,-0.149596,-1.955421,-1.224244,-2.263182
2013-01-04,-1.005985,-0.673704,-0.00546,-1.473756
2013-01-05,-1.154524,-0.233439,-1.837678,-0.310832
2013-01-06,-0.439317,-0.852414,-1.352998,-0.057111


##Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the [Missing Data section](http://pandas.pydata.org/pandas-docs/version/0.15.2/missing_data.html#missing-data)

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [117]:
df = makeDateRand()
dates = df.index
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
print(df1)
df1.loc[dates[0]:dates[1],'E'] = 1
print(df1)


                   A         B         C         D   E
2013-01-01 -0.698407 -0.967118 -1.308633 -0.251677 NaN
2013-01-02  0.095644 -1.359620 -0.200743  0.425139 NaN
2013-01-03 -0.189154  0.641935  0.260038  1.381564 NaN
2013-01-04  1.811386  0.038837 -0.282623  0.699545 NaN
                   A         B         C         D   E
2013-01-01 -0.698407 -0.967118 -1.308633 -0.251677   1
2013-01-02  0.095644 -1.359620 -0.200743  0.425139   1
2013-01-03 -0.189154  0.641935  0.260038  1.381564 NaN
2013-01-04  1.811386  0.038837 -0.282623  0.699545 NaN


To drop any rows that have missing data.

In [118]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.698407,-0.967118,-1.308633,-0.251677,1
2013-01-02,0.095644,-1.35962,-0.200743,0.425139,1


Filling missing data

In [119]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.698407,-0.967118,-1.308633,-0.251677,1
2013-01-02,0.095644,-1.35962,-0.200743,0.425139,1
2013-01-03,-0.189154,0.641935,0.260038,1.381564,5
2013-01-04,1.811386,0.038837,-0.282623,0.699545,5


To get the boolean mask where values are nan

In [120]:
pd.isnull(df1)

Unnamed: 0,A,B,C,D,E
2013-01-01,False,False,False,False,False
2013-01-02,False,False,False,False,False
2013-01-03,False,False,False,False,True
2013-01-04,False,False,False,False,True


##Operations
###Binary operations

See the Basic section on [Binary Ops](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-binop)  

Operations in general exclude missing data.

In [121]:
df.mean()

A    0.022307
B   -0.351439
C    0.062180
D    0.574610
dtype: float64

In [122]:
#along the other axis
df.mean(1)

2013-01-01   -0.806459
2013-01-02   -0.259895
2013-01-03    0.523596
2013-01-04    0.566786
2013-01-05    0.475396
2013-01-06   -0.037937
Freq: D, dtype: float64

###Applying functions to the data

When using apply(), note the axis direction.   
- for each column, apply down a row: axis=0  (default)
- for each row, apply across columns: axis=1.

In [123]:
df = makegridDF()
print(df)
print(df.apply(np.cumsum))
print(df.apply(np.cumsum, axis=0))
print(df.apply(np.cumsum, axis=1))

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A   B   C   D
0  1   4   7  10
1  3   9  15  21
2  6  15  24  33
   A   B   C   D
0  1   4   7  10
1  3   9  15  21
2  6  15  24  33
   A  B   C   D
0  1  5  12  22
1  2  7  15  26
2  3  9  18  30


In [124]:
df = makeDateRand()
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D
2013-01-01,1.260376,1.129251,-0.320283,0.000156
2013-01-02,3.483878,0.289934,0.515737,-1.029528
2013-01-03,4.273068,0.638739,-1.05564,-1.184825
2013-01-04,5.434097,-0.686311,0.971334,-3.485165
2013-01-05,5.554757,0.658838,-0.596334,-3.737291
2013-01-06,5.959648,-0.397774,-1.035941,-2.120366


In [125]:
df = makeDateRand()
print(df)
df.apply(lambda x: x.max() - x.min())

                   A         B         C         D
2013-01-01  0.577599 -0.751487  0.863941 -0.282970
2013-01-02  0.371439 -0.662817  2.177450 -0.785591
2013-01-03  2.361543 -0.737656  0.483291 -2.026924
2013-01-04  0.322357 -0.099805  1.148007 -1.245906
2013-01-05  1.461775 -1.306538  0.302683  1.775165
2013-01-06 -0.564376  0.602491  2.037101 -0.092330


A    2.925919
B    1.909029
C    1.874766
D    3.802089
dtype: float64

In [126]:
from datetime import datetime
df = makeDateRand()
df.index.name = 'Date'
df.reset_index(level=0,inplace=True)
print(df)
#convert to string format
df.Date = df.Date.apply(lambda d: ' '.join(d.isoformat().split('T')))
print(df)
#convert back to datetime format
df.Date = df.Date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d %H:%M:%S"))
print(df)

df.index = df.Date
print(df)

        Date         A         B         C         D
0 2013-01-01 -1.059584 -0.842404 -2.127777  0.324074
1 2013-01-02 -0.634955  1.355091  0.369171 -0.352802
2 2013-01-03 -1.363172  1.567756  1.294977  0.973030
3 2013-01-04 -1.061412  0.018411  1.313773  0.899181
4 2013-01-05  1.188680 -0.236519 -0.879234  0.197410
5 2013-01-06 -0.691675  1.392524  0.200466  0.320492
                  Date         A         B         C         D
0  2013-01-01 00:00:00 -1.059584 -0.842404 -2.127777  0.324074
1  2013-01-02 00:00:00 -0.634955  1.355091  0.369171 -0.352802
2  2013-01-03 00:00:00 -1.363172  1.567756  1.294977  0.973030
3  2013-01-04 00:00:00 -1.061412  0.018411  1.313773  0.899181
4  2013-01-05 00:00:00  1.188680 -0.236519 -0.879234  0.197410
5  2013-01-06 00:00:00 -0.691675  1.392524  0.200466  0.320492
        Date         A         B         C         D
0 2013-01-01 -1.059584 -0.842404 -2.127777  0.324074
1 2013-01-02 -0.634955  1.355091  0.369171 -0.352802
2 2013-01-03 -1.363172  1.567

Applying a function using column data, but with extra parameters.  In the example below we use a value in a single DataFrame column 'IrradianceLux', together with extra parameters, to calculate a new row.

http://stackoverflow.com/questions/21188504/python-pandas-apply-a-function-with-arguments-to-a-series-update

In [127]:
import pandas as pd
lx = {'Sunlight': 107527, 
      'Full daylight': 10752,
      'Overcast day':1075,
      'Very dark day':107,
      'Twilight': 10.8,
      'Deep twilight': 1.08,
      'Full moon': 0.108,
      'Quarter moon':0.0108,
      'Starlight': 0.0011,
      'Overcastnight':0.0001
    }
fnos = [1.4, 2, 2.74, 3.8, 5.4, 7.5
       ]

def calcIrrad(lx, rho, taua, tauo, fno):
    return lx * rho * taua * tauo / (4 * fno ** 2)
    
df = pd.DataFrame(list(lx.items()), columns=['Condition','IrradianceLux'])

rho = 0.3
taua = 0.5
tauo = 0.9
for fno in fnos:
    df['{}'.format(fno)] = df.IrradianceLux.apply(calcIrrad, args=(rho, taua, tauo, fno) )
    
df.sort('IrradianceLux')

Unnamed: 0,Condition,IrradianceLux,1.4,2,2.74,3.8,5.4,7.5
8,Overcastnight,0.0001,2e-06,8.4375e-07,4.495445e-07,2.337258e-07,1.157407e-07,6e-08
2,Starlight,0.0011,1.9e-05,9.28125e-06,4.944989e-06,2.570983e-06,1.273148e-06,6.6e-07
4,Quarter moon,0.0108,0.000186,9.1125e-05,4.85508e-05,2.524238e-05,1.25e-05,6.48e-06
6,Full moon,0.108,0.00186,0.00091125,0.000485508,0.0002524238,0.000125,6.48e-05
3,Deep twilight,1.08,0.018597,0.0091125,0.00485508,0.002524238,0.00125,0.000648
7,Twilight,10.8,0.185969,0.091125,0.0485508,0.02524238,0.0125,0.00648
1,Very dark day,107.0,1.842474,0.9028125,0.4810126,0.2500866,0.1238426,0.0642
0,Overcast day,1075.0,18.510842,9.070312,4.832603,2.512552,1.244213,0.645
5,Full daylight,10752.0,185.142857,90.72,48.33502,25.13019,12.44444,6.4512
9,Sunlight,107527.0,1851.549107,907.2591,483.3817,251.3183,124.4525,64.5162


Apply a function to existing columns to create a new column

In [128]:
def fx(x, y):
    return x*y

import numpy as np
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
df['nocols'] = np.vectorize(fx)(4, 5)
df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
df

Unnamed: 0,A,B,nocols,new_column
0,10,20,20,200
1,20,30,20,600
2,30,10,20,300


Apply a named function on multiple columns and scalar arguments

In [129]:
import pandas as pd 
data = {'gene':['a','b','c','d','e'],
        'count':[61,320,34,14,33],
        'gene_length':[152,86,92,170,111]}
df = pd.DataFrame(data)
df = df[["gene","count","gene_length"]]

def calculate_RPKM(theC,theN, theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = ((10**9) * theC)/(theN * theL)
    return rpkm
N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
df

Unnamed: 0,gene,count,gene_length,rpkm
0,a,61,152,32508.366908
1,b,320,86,301411.926493
2,c,34,92,29936.429112
3,d,14,170,6670.955138
4,e,33,111,24082.405613


Use apply to return multiple columns

<http://stackoverflow.com/questions/16236684/apply-pandas-function-to-column-to-create-multiple-new-columns>

In [130]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'textcol' : np.random.rand(5)})
df.merge(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})), 
    left_index=True, right_index=True)

Unnamed: 0,textcol,feature1,feature2
0,0.839979,1.839979,-0.160021
1,0.949421,1.949421,-0.050579
2,0.585679,1.585679,-0.414321
3,0.208017,1.208017,-0.791983
4,0.849532,1.849532,-0.150468


Return multiple columns, operating on a single column, with additional arguments

In [131]:
def myfunc(s,a1, a2):
    return pd.Series({'feature1':s+a1, 'feature2':s+a2})
    
df = pd.DataFrame({'textcol' : np.random.rand(5)})
df.merge(df.textcol.apply(myfunc,args=(+2, -5)), 
    left_index=True, right_index=True)    
    

Unnamed: 0,textcol,feature1,feature2
0,0.14462,2.14462,-4.85538
1,0.337051,2.337051,-4.662949
2,0.82207,2.82207,-4.17793
3,0.227358,2.227358,-4.772642
4,0.514909,2.514909,-4.485091


The following example calculates the angle between two normal vectors, where the two vectors are given in separate tables. The common value whereby the two tables are joined is given in the 'Key' column in each of the tables.

In [132]:
from numpy import linalg as LA

def normCols(df, lst):
    """Normalise the columns in df, as listed in lst"""
    #get the vector length
    df['norm'] = (LA.norm(df[lst],axis=1))
    #normalise cols in list and return
    df[lst] = df[lst].divide(df['norm'], axis=0)
    df.drop('norm',axis=1,inplace=True)
    #value seems to be returned in the df parameter passed on the function call
    return 

#create the data: [key, x, y, z]
lstA = [['a',1,0,0],['b',1,0,0],['c', 1,0,0],['d',-4.164548,2.835452,0.835452],['e',-4.164548,2.835452,0.835452]]
lstB = [['a',1,0,0],['b',0,1,0],['c',-1,0,0],['d',-4.164548,2.835452,0.835452],['e',-3.164548,1.835452,2.835452]]

#make dataframes
vec = ['x','y','z']
cols = ['Key'] + vec
dfA = pd.DataFrame(lstA,columns=cols)
dfB = pd.DataFrame(lstB,columns=cols)

#normalise vectors
normCols(dfA, vec)
normCols(dfB, vec)

#join the two vectors on the Key column
dfA.reset_index(inplace=True)
dfB.reset_index(inplace=True)
suffixes = ['a','b'] # used in labelling duplicate column names
dfM = pd.merge(dfA, dfB, left_on=['Key'],right_on=['Key'], how='inner', suffixes=suffixes)  

#calc angle between vectors
va = ['xa','ya','za']
vb = ['xb','yb','zb']
# inner() calculates Va x Vb.T, on which the diagonal() is 
# the dot product of the respective rowA and rowB.T vectors
dfM['angle'] = np.arccos(np.diagonal(np.inner(dfM[va],dfM[vb])))

print(dfM)

   indexa Key        xa        ya       za  indexb        xb        yb        zb         angle
0       0   a  1.000000  0.000000  0.00000       0  1.000000  0.000000  0.000000  0.000000e+00
1       1   b  1.000000  0.000000  0.00000       1  0.000000  1.000000  0.000000  1.570796e+00
2       2   c  1.000000  0.000000  0.00000       2 -1.000000  0.000000  0.000000  3.141593e+00
3       3   d -0.815462  0.555211  0.16359       3 -0.815462  0.555211  0.163590  1.490116e-08
4       4   e -0.815462  0.555211  0.16359       4 -0.683709  0.396554  0.612607  4.992820e-01


In [133]:
a=np.array([[1,2],[3,4]])
b=np.array([[11,12],[13,14]])
print(a)
print(b.T)
# print(np.dot(a,b))
print(np.inner(a,b))


[[1 2]
 [3 4]]
[[11 13]
 [12 14]]
[[35 41]
 [81 95]]


###Histograms

[Histogramming and Discretization](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-discretization)

In [134]:
s = pd.Series(np.random.randint(0,7,size=10))
s.value_counts()

6    3
3    2
1    2
0    2
5    1
dtype: int64

### Strings
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses [regular expressions](https://docs.python.org/2/library/re.html) by default (and in some cases always uses them). See more at [Vectorized String Methods](http://pandas.pydata.org/pandas-docs/version/0.15.2/text.html#text-string-methods).

In [135]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
s.str.upper()
s.str.len()

0     1
1     1
2     1
3     4
4     4
5   NaN
6     4
7     3
8     3
dtype: float64

In [136]:
# Methods like split return a Series of lists:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_')

0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

In [137]:
s2.str.split('_').str[1]

0      b
1      d
2    NaN
3      g
dtype: object

In [138]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
s.str[1]

0    NaN
1    NaN
2    NaN
3      a
4      a
5    NaN
6      A
7      o
8      a
dtype: object

You can use [] notation to directly index by position locations. If you index past the end of the string, the result will be a NaN.

In [139]:
# Easy to expand this to return a DataFrame
s2.str.split('_').apply(pd.Series)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


Methods like replace and findall take regular expressions, too:

In [140]:
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3.str.replace('^.a|dog', 'XX-XX ', case=False)

0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

###Grouping
By “group by” we are referring to a process involving one or more of the following steps  

- Splitting the data into groups based on some criteria  
- Applying a function to each group independently    
- Combining the results into a data structure   

[grouping](http://pandas.pydata.org/pandas-docs/version/0.15.2/groupby.html#groupby)

In [141]:
df = makefoobar()
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.269862,-1.259584
1,bar,one,-1.968966,-0.17259
2,foo,two,0.249694,0.856649
3,bar,three,0.350349,1.093341
4,foo,two,1.006223,-0.247256
5,bar,two,0.813212,-0.089682
6,foo,one,0.771463,0.922694
7,foo,three,-0.100108,-0.706631


Grouping and then applying a function sum to the resulting groups.

In [142]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-0.805405,0.831069
foo,1.657409,-0.434127


In [143]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-1.968966,-0.17259
bar,three,0.350349,1.093341
bar,two,0.813212,-0.089682
foo,one,0.501601,-0.336889
foo,three,-0.100108,-0.706631
foo,two,1.255916,0.609393


In [144]:
df = makefoobar()
print(df)
cnts = {}
#first value is group column value, seond value is the members in the group
for grp, grp_data in df.groupby("B"):
    cnts[grp] = grp_data.C.mean()  
cnts

     A      B         C         D
0  foo    one -0.232744 -1.186992
1  bar    one -0.174555  0.601828
2  foo    two -1.521671 -1.018627
3  bar  three  0.852975 -0.859311
4  foo    two -0.645985  2.470934
5  bar    two -0.777122 -0.524432
6  foo    one -1.224340 -0.318513
7  foo  three  0.705881  1.077993


{'one': -0.5438797145720788,
 'three': 0.7794279457249833,
 'two': -0.9815926715579861}

###Reshaping

[Hierarchical Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced-hierarchical) and [Reshaping](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-stacking).

In [145]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                    'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                    'one', 'two', 'one', 'two']]))
print(tuples)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2

[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.513028,-1.002175
bar,two,0.098271,0.845286
baz,one,-1.30924,-1.648563
baz,two,0.604084,0.774209


The stack function “compresses” a level in the DataFrame’s columns.

In [146]:
stacked = df2.stack()
stacked

first  second   
bar    one     A    0.513028
               B   -1.002175
       two     A    0.098271
               B    0.845286
baz    one     A   -1.309240
               B   -1.648563
       two     A    0.604084
               B    0.774209
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:

In [147]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.513028,-1.002175
bar,two,0.098271,0.845286
baz,one,-1.30924,-1.648563
baz,two,0.604084,0.774209


In [148]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,0.513028,0.098271
bar,B,-1.002175,0.845286
baz,A,-1.30924,0.604084
baz,B,-1.648563,0.774209


In [149]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.513028,-1.30924
one,B,-1.002175,-1.648563
two,A,0.098271,0.604084
two,B,0.845286,0.774209


##Categoricals
see the [categorical introduction](http://pandas.pydata.org/pandas-docs/version/0.15.2/categorical.html#categorical) and the [API documentation](http://pandas.pydata.org/pandas-docs/version/0.15.2/api.html#api-categorical).

In [150]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
# Convert the raw grades to a categorical data type.
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

Rename the categories to more meaningful names (assigning to Series.cat.categories is in place!) Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series per default).

In [151]:
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

In [152]:
# Sorting is per order in the categories, not lexical order.
df.sort("grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


In [153]:
# Grouping by a categorical column shows also empty categories.
df.groupby("grade").size()

grade
very bad      1
bad         NaN
medium      NaN
good          2
very good     3
dtype: float64

##Pivot tables
See the section on [Pivot Tables](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-pivot).

In [154]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,-0.080959,-0.335984
1,one,B,foo,0.734002,-1.475041
2,two,C,foo,-0.200586,2.078062
3,three,A,bar,-0.648952,-1.705348
4,one,B,bar,-0.217767,0.319763
...,...,...,...,...,...
7,three,B,foo,1.093918,0.103627
8,one,C,foo,1.391357,0.960531
9,one,A,bar,-0.463493,0.929945
10,two,B,bar,0.562423,-0.178655


In [155]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])


Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.463493,-0.080959
one,B,-0.217767,0.734002
one,C,-0.903001,1.391357
three,A,-0.648952,
three,B,,1.093918
three,C,1.170372,
two,A,,-0.052444
two,B,0.562423,
two,C,,-0.200586


Reconstruct the sampled presented in the pages at [Pandas Pivot Table Explained ](http://pbpython.com/pandas-pivot-table-explained.html) and [Generating Excel Reports from a Pandas Pivot Table ](http://pbpython.com/pandas-pivot-report.html).  

First load the data and set the Status column as a pandas `category` and set the viewing order. Set the `Name` as table index.

In [156]:
df = pd.read_excel("./data/sales-funnel.xlsx")
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
print(df.head(4))

print(pd.pivot_table(df,index=["Name"]))


   Account                          Name           Rep       Manager      Product  Quantity  Price     Status
0   714466               Trantow-Barrows  Craig Booker  Debra Henley          CPU         1  30000  presented
1   714466               Trantow-Barrows  Craig Booker  Debra Henley     Software         1  10000  presented
2   714466               Trantow-Barrows  Craig Booker  Debra Henley  Maintenance         2   5000    pending
3   737550  Fritsch, Russel and Anderson  Craig Booker  Debra Henley          CPU         1  35000   declined
                              Account  Price  Quantity
Name                                                  
Barton LLC                     740150  35000  1.000000
Fritsch, Russel and Anderson   737550  35000  1.000000
Herman LLC                     141962  65000  2.000000
Jerde-Hilpert                  412290   5000  2.000000
Kassulke, Ondricka and Metz    307599   7000  3.000000
...                               ...    ...       ...
Koepp Ltd 

##Comparing and Gotchas
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-compare>  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#boolean-reductions>   

pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not.  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/gotchas.html#gotchas>


#Date and time

In [157]:
# print('0')
# TMY = pd.DataFrame([0,1],index=['1981-01-01T10:00:00.000000000+0200', '1981-01-01T11:00:00.000000000+0200'])
# print(TMY)
# print('1')
# print(type(TMY.index.to_datetime().values))
# print(TMY.index.to_datetime().values)
# print('2')
# print(type(TMY.index.astype(np.int64)))
# print(TMY.index.astype(np.int64) // 10**9)  #timestamp is unix time with nanoseconds
# print('3')
# print(type(pd.to_datetime(TMY.index.astype(np.int64))))
# print(pd.to_datetime(TMY.index.astype(np.int64)))  #timestamp is unix time with nanoseconds


## Python and [module versions, and dates](http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-0-Scientific-Computing-with-Python.ipynb)

In [158]:
%load_ext version_information
%version_information pandas, numpy, scipy, matplotlib, pyradi

Software,Version
Python,2.7.8 32bit [MSC v.1500 32 bit (Intel)]
IPython,3.2.0
OS,Windows 7 6.1.7601 SP1
pandas,0.16.2
numpy,1.9.2
scipy,0.15.1
matplotlib,1.4.3
pyradi,0.2.1
Mon Nov 16 10:03:46 2015 South Africa Standard Time,Mon Nov 16 10:03:46 2015 South Africa Standard Time
