#Pandas Cheat Sheet

References:  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/10min.html>    
<http://synesthesiam.com/posts/an-introduction-to-pandas.html>  
<http://pbpython.com/excel-pandas-comp.html>  

<https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf>  

<https://iqbalnaved.wordpress.com/2013/08/26/python-pandas-hacks/>  


In [2]:
import numpy as np
import pandas as pd

###Set the display width for print output

In [3]:
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 150)

##Creating/Loading Data

###NaN in Pandas / Numpy

In [6]:
a = np.nan

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns.

In [125]:
dates = pd.date_range('20130101',periods=6)
print(dates)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
print(df)


<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None
                   A         B         C         D
2013-01-01 -1.865614 -1.510368 -0.753039  2.145441
2013-01-02  0.466112 -0.785366  2.276897  0.110845
2013-01-03  1.165091 -0.494944  0.482337 -0.500579
2013-01-04 -0.896564  1.990713 -1.313424 -1.433929
2013-01-05  1.119197 -0.068847 -1.797903 -0.245404
2013-01-06  0.683739  0.967133  0.576356  0.819930


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [14]:
df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                      'E' : pd.Categorical(["test","train","test","train"]),
                      'F' : 'foo' })
print(df2.dtypes)
df2


A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object


Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1,3,test,foo
1,1,2013-01-02,1,3,train,foo
2,1,2013-01-02,1,3,test,foo
3,1,2013-01-02,1,3,train,foo


If you're using IPython, tab completion for column names (as well as public attributes) is automatically enabled:  
    `df2.<Tab>`

In [126]:
df.to_csv('foo.csv')
pd.read_csv('foo.csv')

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2013-01-01,-1.865614,-1.510368,-0.753039,2.145441
1,2013-01-02,0.466112,-0.785366,2.276897,0.110845
2,2013-01-03,1.165091,-0.494944,0.482337,-0.500579
3,2013-01-04,-0.896564,1.990713,-1.313424,-1.433929
4,2013-01-05,1.119197,-0.068847,-1.797903,-0.245404
5,2013-01-06,0.683739,0.967133,0.576356,0.81993


In [127]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Unnamed: 0,A,B,C,D
2013-01-01,-1.865614,-1.510368,-0.753039,2.145441
2013-01-02,0.466112,-0.785366,2.276897,0.110845
2013-01-03,1.165091,-0.494944,0.482337,-0.500579
2013-01-04,-0.896564,1.990713,-1.313424,-1.433929
2013-01-05,1.119197,-0.068847,-1.797903,-0.245404
2013-01-06,0.683739,0.967133,0.576356,0.81993


###Get the Number of Rows in a DataFrame

In [48]:
df = pd.DataFrame([[1, np.nan], [3, 4]], columns=list('AB'))
print(df)
print('\nshape[0]={}'.format(df.shape[0]))
print('\ndf.count()[0]={}'.format(df.count()[0]))
print('\ndf.count()[1]={}'.format(df.count()[1]))

   A   B
0  1 NaN
1  3   4

shape[0]=2

df.count()[0]=2

df.count()[1]=1


##Manipulating DataFrames

###Adding a row to a DataFrame

In [49]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
# print(df)
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df = df.append(df2)
print(df)

   A  B
0  1  2
1  3  4
0  5  6
1  7  8


In [17]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df.index)
df.loc[df.shape[0]] = ['a','b']
df.loc[df.shape[0]] = ['d','e']

print(df)


Int64Index([0, 1], dtype='int64')
   A  B
0  1  2
1  3  4
2  a  b
3  d  e


In [18]:
df.head(2)

Unnamed: 0,A,B
0,1,2
1,3,4


In [19]:
df.tail(2)

Unnamed: 0,A,B
2,a,b
3,d,e


In [20]:
df.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [21]:
df.columns

Index([u'A', u'B'], dtype='object')

In [22]:
df.values

array([[1L, 2L],
       [3L, 4L],
       ['a', 'b'],
       ['d', 'e']], dtype=object)

In [23]:
df.T

Unnamed: 0,0,1,2,3
A,1,3,a,d
B,2,4,b,e


###Sorting

In [27]:
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,1.103425,0.033224,-0.23525,-1.276837
2013-01-02,1.436771,0.604065,0.934772,1.398259
2013-01-03,-0.918774,-1.100444,0.173383,2.395457
2013-01-04,-0.687313,0.997013,2.605382,0.457795
2013-01-05,1.401041,0.133755,-1.36922,-0.445369
2013-01-06,0.165939,1.153403,-0.186565,0.498889


Sort by column index (axis=1)

In [28]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-1.276837,-0.23525,0.033224,1.103425
2013-01-02,1.398259,0.934772,0.604065,1.436771
2013-01-03,2.395457,0.173383,-1.100444,-0.918774
2013-01-04,0.457795,2.605382,0.997013,-0.687313
2013-01-05,-0.445369,-1.36922,0.133755,1.401041
2013-01-06,0.498889,-0.186565,1.153403,0.165939


Sort all rows by value in column B

In [29]:
df.sort(columns='B')

Unnamed: 0,A,B,C,D
2013-01-03,-0.918774,-1.100444,0.173383,2.395457
2013-01-01,1.103425,0.033224,-0.23525,-1.276837
2013-01-05,1.401041,0.133755,-1.36922,-0.445369
2013-01-02,1.436771,0.604065,0.934772,1.398259
2013-01-04,-0.687313,0.997013,2.605382,0.457795
2013-01-06,0.165939,1.153403,-0.186565,0.498889


###Selecting portions

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix. 

[Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing)  
[MultiIndex / Advanced Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced)

Selecting a single column, which yields a Series, equivalent to df.A

In [30]:
df['A']

2013-01-01    1.103425
2013-01-02    1.436771
2013-01-03   -0.918774
2013-01-04   -0.687313
2013-01-05    1.401041
2013-01-06    0.165939
Freq: D, Name: A, dtype: float64

In [35]:
np.asarray(df['A'])

array([ 1.10342469,  1.436771  , -0.91877426, -0.68731321,  1.40104074,
        0.1659392 ])

Selecting via [], which slices the rows.

In [38]:
df[2:4]

Unnamed: 0,A,B,C,D
2013-01-03,-0.918774,-1.100444,0.173383,2.395457
2013-01-04,-0.687313,0.997013,2.605382,0.457795


In [37]:
df['2013-01-01':'2013-01-02']

Unnamed: 0,A,B,C,D
2013-01-01,1.103425,0.033224,-0.23525,-1.276837
2013-01-02,1.436771,0.604065,0.934772,1.398259


###[Selection by Label](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-label)

In [39]:
df.loc[dates[0]]

A    1.103425
B    0.033224
C   -0.235250
D   -1.276837
Name: 2013-01-01 00:00:00, dtype: float64

In [40]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,1.103425,0.033224
2013-01-02,1.436771,0.604065
2013-01-03,-0.918774,-1.100444
2013-01-04,-0.687313,0.997013
2013-01-05,1.401041,0.133755
2013-01-06,0.165939,1.153403


In [41]:
#endpoints included
df.loc['20130102':'20130104',['A','B']]

Unnamed: 0,A,B
2013-01-02,1.436771,0.604065
2013-01-03,-0.918774,-1.100444
2013-01-04,-0.687313,0.997013


In [42]:
df.loc['20130102',['A','B']]

A    1.436771
B    0.604065
Name: 2013-01-02 00:00:00, dtype: float64

###[Selection by Position](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-integer)

In [43]:
df.iloc[3]

A   -0.687313
B    0.997013
C    2.605382
D    0.457795
Name: 2013-01-04 00:00:00, dtype: float64

In [44]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,-0.687313,0.997013
2013-01-05,1.401041,0.133755


In [45]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,1.436771,0.934772
2013-01-03,-0.918774,0.173383
2013-01-05,1.401041,-1.36922


In [46]:
#slicing rows
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2013-01-02,1.436771,0.604065,0.934772,1.398259
2013-01-03,-0.918774,-1.100444,0.173383,2.395457


In [47]:
#slicing columns
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,0.033224,-0.23525
2013-01-02,0.604065,0.934772
2013-01-03,-1.100444,0.173383
2013-01-04,0.997013,2.605382
2013-01-05,0.133755,-1.36922
2013-01-06,1.153403,-0.186565


In [48]:
df.iloc[1,1]

0.60406479592945261

###Boolean indexing

In [50]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-01,1.103425,0.033224,-0.23525,-1.276837
2013-01-02,1.436771,0.604065,0.934772,1.398259
2013-01-05,1.401041,0.133755,-1.36922,-0.445369
2013-01-06,0.165939,1.153403,-0.186565,0.498889


In [51]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,1.103425,0.033224,,
2013-01-02,1.436771,0.604065,0.934772,1.398259
2013-01-03,,,0.173383,2.395457
2013-01-04,,0.997013,2.605382,0.457795
2013-01-05,1.401041,0.133755,,
2013-01-06,0.165939,1.153403,,0.498889


In [53]:
#isin filtering
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
print(df2)
df2[df2['E'].isin(['two','four'])]

                   A         B         C         D      E
2013-01-01  1.103425  0.033224 -0.235250 -1.276837    one
2013-01-02  1.436771  0.604065  0.934772  1.398259    one
2013-01-03 -0.918774 -1.100444  0.173383  2.395457    two
2013-01-04 -0.687313  0.997013  2.605382  0.457795  three
2013-01-05  1.401041  0.133755 -1.369220 -0.445369   four
2013-01-06  0.165939  1.153403 -0.186565  0.498889  three


Unnamed: 0,A,B,C,D,E
2013-01-03,-0.918774,-1.100444,0.173383,2.395457,two
2013-01-05,1.401041,0.133755,-1.36922,-0.445369,four


###Setting data

Setting a new column automatically aligns the data by the indexes

In [57]:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
df2['F'] = s1
df2

Unnamed: 0,A,B,C,D,E,F
2013-01-01,1.103425,0.033224,-0.23525,-1.276837,one,
2013-01-02,1.436771,0.604065,0.934772,1.398259,one,1.0
2013-01-03,-0.918774,-1.100444,0.173383,2.395457,two,2.0
2013-01-04,-0.687313,0.997013,2.605382,0.457795,three,3.0
2013-01-05,1.401041,0.133755,-1.36922,-0.445369,four,4.0
2013-01-06,0.165939,1.153403,-0.186565,0.498889,three,5.0


In [59]:
# Setting values by label
df.at[dates[0],'A'] = 0
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.033224,-0.23525,-1.276837,
2013-01-02,1.436771,0.604065,0.934772,1.398259,1.0
2013-01-03,-0.918774,-1.100444,0.173383,2.395457,2.0
2013-01-04,-0.687313,0.997013,2.605382,0.457795,3.0
2013-01-05,1.401041,0.133755,-1.36922,-0.445369,4.0
2013-01-06,0.165939,1.153403,-0.186565,0.498889,5.0


In [62]:
# Setting values by position
df.iat[0,1] = 7
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,7.0,-0.23525,-1.276837,
2013-01-02,1.436771,0.604065,0.934772,1.398259,1.0
2013-01-03,-0.918774,-1.100444,0.173383,2.395457,2.0
2013-01-04,-0.687313,0.997013,2.605382,0.457795,3.0
2013-01-05,1.401041,0.133755,-1.36922,-0.445369,4.0
2013-01-06,0.165939,1.153403,-0.186565,0.498889,5.0


In [65]:
# Setting by assigning with a numpy array
df.loc[:,'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,7.0,-0.23525,5,
2013-01-02,1.436771,0.604065,0.934772,5,1.0
2013-01-03,-0.918774,-1.100444,0.173383,5,2.0
2013-01-04,-0.687313,0.997013,2.605382,5,3.0
2013-01-05,1.401041,0.133755,-1.36922,5,4.0
2013-01-06,0.165939,1.153403,-0.186565,5,5.0


In [66]:
# A where operation with setting.
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,-7.0,-0.23525,-5,
2013-01-02,-1.436771,-0.604065,-0.934772,-5,-1.0
2013-01-03,-0.918774,-1.100444,-0.173383,-5,-2.0
2013-01-04,-0.687313,-0.997013,-2.605382,-5,-3.0
2013-01-05,-1.401041,-0.133755,-1.36922,-5,-4.0
2013-01-06,-0.165939,-1.153403,-0.186565,-5,-5.0


##Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the [Missing Data section](http://pandas.pydata.org/pandas-docs/version/0.15.2/missing_data.html#missing-data)

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [68]:
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,7.0,-0.23525,5,,1.0
2013-01-02,1.436771,0.604065,0.934772,5,1.0,1.0
2013-01-03,-0.918774,-1.100444,0.173383,5,2.0,
2013-01-04,-0.687313,0.997013,2.605382,5,3.0,


To drop any rows that have missing data.

In [69]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,F,E
2013-01-02,1.436771,0.604065,0.934772,5,1,1


Filling missing data

In [70]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,7.0,-0.23525,5,5,1
2013-01-02,1.436771,0.604065,0.934772,5,1,1
2013-01-03,-0.918774,-1.100444,0.173383,5,2,5
2013-01-04,-0.687313,0.997013,2.605382,5,3,5


To get the boolean mask where values are nan

In [71]:
pd.isnull(df1)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


##Operations
###Binary operations

See the Basic section on [Binary Ops](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-binop)  

Operations in general exclude missing data.

In [73]:
df.mean()

A    0.232944
B    1.464632
C    0.320417
D    5.000000
F    3.000000
dtype: float64

In [74]:
#along the other axis
df.mean(1)

2013-01-01    2.941187
2013-01-02    1.795122
2013-01-03    1.030833
2013-01-04    2.183016
2013-01-05    1.833115
2013-01-06    2.226555
Freq: D, dtype: float64

Applying functions to the data

In [75]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,7.0,-0.23525,5,
2013-01-02,1.436771,7.604065,0.699522,10,1.0
2013-01-03,0.517997,6.503621,0.872904,15,3.0
2013-01-04,-0.169316,7.500634,3.478287,20,6.0
2013-01-05,1.231724,7.634389,2.109067,25,10.0
2013-01-06,1.397663,8.787791,1.922502,30,15.0


In [76]:
df.apply(lambda x: x.max() - x.min())

A    2.355545
B    8.100444
C    3.974603
D    0.000000
F    4.000000
dtype: float64

###Histograms

[Histogramming and Discretization](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-discretization)

In [78]:
s = pd.Series(np.random.randint(0,7,size=10))
s.value_counts()

2    5
6    1
5    1
4    1
3    1
0    1
dtype: int64

### Strings
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses [regular expressions](https://docs.python.org/2/library/re.html) by default (and in some cases always uses them). See more at [Vectorized String Methods](http://pandas.pydata.org/pandas-docs/version/0.15.2/text.html#text-string-methods).

In [80]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
s.str.upper()
s.str.len()

0     1
1     1
2     1
3     4
4     4
5   NaN
6     4
7     3
8     3
dtype: float64

In [82]:
# Methods like split return a Series of lists:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_')

0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

In [85]:
s2.str.split('_').str[1]

0      b
1      d
2    NaN
3      g
dtype: object

In [89]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
s.str[1]

0    NaN
1    NaN
2    NaN
3      a
4      a
5    NaN
6      A
7      o
8      a
dtype: object

You can use [] notation to directly index by position locations. If you index past the end of the string, the result will be a NaN.

In [84]:
# Easy to expand this to return a DataFrame
s2.str.split('_').apply(pd.Series)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


Methods like replace and findall take regular expressions, too:

In [87]:
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3.str.replace('^.a|dog', 'XX-XX ', case=False)

0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

###Merge and concat

In [90]:
# Concatenating pandas objects together
df = pd.DataFrame(np.random.randn(10, 4))
print(df)
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)

          0         1         2         3
0  1.322174 -0.474578  0.505505 -0.706056
1  0.545372  0.001437 -0.577440 -1.241909
2 -0.398184  0.824804  0.548485  1.274906
3 -0.092182  0.466737  1.418709  0.824420
4 -0.226422 -1.444101 -0.409759 -1.178153
5  0.032576  1.665136 -0.689957  1.673670
6  0.389093  0.293846  0.921651 -0.056348
7  0.762941 -1.855544  0.955199  1.237590
8  1.736763  0.199994 -0.476021  0.244189
9 -0.398968 -1.164893  0.334531  0.117905


Unnamed: 0,0,1,2,3
0,1.322174,-0.474578,0.505505,-0.706056
1,0.545372,0.001437,-0.57744,-1.241909
2,-0.398184,0.824804,0.548485,1.274906
3,-0.092182,0.466737,1.418709,0.82442
4,-0.226422,-1.444101,-0.409759,-1.178153
5,0.032576,1.665136,-0.689957,1.67367
6,0.389093,0.293846,0.921651,-0.056348
7,0.762941,-1.855544,0.955199,1.23759
8,1.736763,0.199994,-0.476021,0.244189
9,-0.398968,-1.164893,0.334531,0.117905


SQL style merges. See the [Database style joining](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-join)

In [91]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print(left)
print(right)
pd.merge(left, right, on='key')

   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5


Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


Append rows to a dataframe. See the [Appending](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-concatenation)

In [92]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,1.294486,-0.889948,-1.881153,-0.433599
1,0.278342,-0.216489,-0.615418,1.718508
2,0.442856,1.078641,-0.357037,0.903483
3,-0.497698,0.326467,-1.618047,-0.71472
4,0.152452,0.823727,-0.675231,1.324093
5,0.753909,0.15814,-1.101824,0.967674
6,1.210821,1.721259,-1.693436,-0.11693
7,-0.003423,0.075932,-2.404601,-0.757169
8,-0.497698,0.326467,-1.618047,-0.71472


###Grouping
By “group by” we are referring to a process involving one or more of the following steps  

- Splitting the data into groups based on some criteria  
- Applying a function to each group independently    
- Combining the results into a data structure   

[grouping](http://pandas.pydata.org/pandas-docs/version/0.15.2/groupby.html#groupby)

In [96]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                          'B' : ['one', 'one', 'two', 'three',
                                'two', 'two', 'one', 'three'],
                          'C' : np.random.randn(8),
                          'D' : np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.712806,-1.172065
1,bar,one,1.147885,1.776428
2,foo,two,-1.822217,0.477578
3,bar,three,-1.549544,0.320973
4,foo,two,-0.777359,-1.523429
5,bar,two,-0.164093,-0.89494
6,foo,one,1.957744,0.724117
7,foo,three,1.537824,0.583602


Grouping and then applying a function sum to the resulting groups.

In [97]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-0.565751,1.202461
foo,0.183184,-0.910196


In [98]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.147885,1.776428
bar,three,-1.549544,0.320973
bar,two,-0.164093,-0.89494
foo,one,1.244937,-0.447947
foo,three,1.537824,0.583602
foo,two,-2.599576,-1.045851


###Reshaping

[Hierarchical Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced-hierarchical) and [Reshaping](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-stacking).

In [110]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                    'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                    'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.952207,0.289538
bar,two,-0.590571,-0.181043
baz,one,0.21131,-0.417464
baz,two,0.75775,1.370016


The stack function “compresses” a level in the DataFrame’s columns.

In [111]:
stacked = df2.stack()
stacked

first  second   
bar    one     A    0.952207
               B    0.289538
       two     A   -0.590571
               B   -0.181043
baz    one     A    0.211310
               B   -0.417464
       two     A    0.757750
               B    1.370016
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:

In [112]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.952207,0.289538
bar,two,-0.590571,-0.181043
baz,one,0.21131,-0.417464
baz,two,0.75775,1.370016


In [113]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,0.952207,-0.590571
bar,B,0.289538,-0.181043
baz,A,0.21131,0.75775
baz,B,-0.417464,1.370016


In [114]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.952207,0.21131
one,B,0.289538,-0.417464
two,A,-0.590571,0.75775
two,B,-0.181043,1.370016


###Pivot tables
See the section on [Pivot Tables](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-pivot).

In [116]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,0.589578,-2.154247
1,one,B,foo,1.171831,0.524318
2,two,C,foo,0.646588,-0.332976
3,three,A,bar,-0.210874,0.272061
4,one,B,bar,0.164178,-0.498959
5,one,C,bar,-1.124013,0.999033
6,two,A,foo,1.595621,0.049699
7,three,B,foo,-1.065926,-0.842953
8,one,C,foo,1.90886,-0.487834
9,one,A,bar,0.538731,-0.282055


In [117]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.538731,0.589578
one,B,0.164178,1.171831
one,C,-1.124013,1.90886
three,A,-0.210874,
three,B,,-1.065926
three,C,-0.420999,
two,A,,1.595621
two,B,-0.745748,
two,C,,0.646588


##Categoricals
see the [categorical introduction](http://pandas.pydata.org/pandas-docs/version/0.15.2/categorical.html#categorical) and the [API documentation](http://pandas.pydata.org/pandas-docs/version/0.15.2/api.html#api-categorical).

In [119]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
# Convert the raw grades to a categorical data type.
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a < b < e]

Rename the categories to more meaningful names (assigning to Series.cat.categories is in place!) Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series per default).

In [120]:
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad < bad < medium < good < very good]

In [121]:
# Sorting is per order in the categories, not lexical order.
df.sort("grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


In [122]:
# Grouping by a categorical column shows also empty categories.
df.groupby("grade").size()

grade
very bad      1
bad         NaN
medium      NaN
good          2
very good     3
dtype: float64

##Comparing and Gotchas
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-compare>  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#boolean-reductions>   

pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not.  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/gotchas.html#gotchas>
