#Pandas Cheat Sheet

References:  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/10min.html>    
<http://synesthesiam.com/posts/an-introduction-to-pandas.html>  
<http://pbpython.com/excel-pandas-comp.html>  
<http://pbpython.com/excel-pandas-comp-2.html>  

<https://iqbalnaved.wordpress.com/2013/08/26/python-pandas-hacks/>   

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime

<https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf>   

|Create pandas data structures| |
|--|--|
|s = Series(data, index) |Create a Series.|
|df = DataFrame (data, index, columns) |Create a Dataframe.|
|p = Panel(data, items, major_axis, minor_axis)|Create a Panel.|


|	DataFrame Commands	|		|
|--|--|
|	df[col]	|	Select column.	|
|	df.iloc[label]	|	Select row by label.	|
|	df.index	|	Return DataFrame index.	|
|	df.drop()	|	Delete given row or column. Pass axis=1 for columns.	|
|	df1 = df1.reindex_like(df1,df2)	|	Reindex df1 with index of df2.	|
|	df.reset_index()	|	Reset index, putting old index in column named index.	|
|	df.reindex()	|	Change DataFrame index, new indecies set to NaN.	|
|	df.head(n)	|	Show first n rows.	|
|	df.tail(n)	|	Show last n rows.	|
|	df.sort()	|	Sort index.	|
|	df.sort(axis=1)	|	Sort columns.	|
|	df.pivot(index,column,values)	|	Pivot DataFrame, using new conditions.	|
|	df.T	|	Transpose DataFrame.	|
|	df.stack()	|	Change lowest level of column labels into innermost row index.	|
|	df.unstack()	|	Change innermost row index into lowest level of column labels.	|
|	df.applymap()	|	Apply function to every element in DataFrame.	|
|	df.apply()	|	Apply function along a given axis	|
|	df.dropna()	|	Drops rows where any data is missing.	|
|	df.count()	|	Returns Series of row counts for every column.	|
|	df.min()	|	Return minimum of every column.	|
|	df.max()	|	Return maximum of every column.	|
|	df.describe()	|	Generate various summary statistics for every column.	|
|	concat()	|	Merge DataFrame or Series objects	|	

|	Groupby	|		|
|--|--|
|	groupby()	|	Split DataFrame by columns. Creates a GroupBy object (gb).	|
|	gb.agg()	|	Apply function (single or list) to a GroupBy object.	|
|	gb.transform()	|	Applies function and returns object with same index as one being grouped.	|
|	gb.filter()	|	Filter GroupBy object by a given function.	|
|	gb.groups	|	Return dict whose keys are the unique groups, and values are axis labels belonging to each group.	|

|	I/O	|		|
|--|--|
|	df.to_csv(‘foo.csv’)	|	Save to CSV.	|
|	read_csv(‘foo.csv’)	|	Read CSV into DataFrame.	|
|	to_excel(‘foo.xlsx’, sheet_name)	|	Save to Excel.	|
|	read_excel(‘foo.xlsx’,’sheet1’, index_col = None, na_values = [‘NA’])	|	Read exel into DataFrame	|

	


###Set the display width for print output

There are quite a few options to configure here, if you're using ipython then tab complete to find the [full set](http://pandas.pydata.org/pandas-docs/stable/basics.html#working-with-package-options) of display options:

    pd.options.display.<tab>

<http://stackoverflow.com/questions/21249206/how-to-configure-display-output-in-ipython-pandas>

In [2]:
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 150)
pd.set_option('display.max_colwidth', 150)

In [3]:
def makeDateRand():
    dates = pd.date_range('20130101',periods=6)
    df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))    
    return df

In [4]:
def makefoobar():
    return  pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                          'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                          'C' : np.random.randn(8),
                          'D' : np.random.randn(8)})

In [5]:
def makegridDF():
    return  pd.DataFrame({'A' : [1,2,3],
                          'B' : [4,5,6],
                          'C' : [7,8,9],
                          'D' : [10,11,12]})



##Creating/Loading Data

###NaN in Pandas / Numpy

In [6]:
a = np.nan
print(a)
print(pd.isnull(a))

nan
True


In [7]:
df = pd.DataFrame([[1, np.nan], [3, 4]], columns=list('AB'))
print(df)
print(pd.isnull(a))
#change all NaN to some other value
df.B = df.B.fillna('**')
print(df)

   A   B
0  1 NaN
1  3   4
True
   A   B
0  1  **
1  3   4


Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns.

In [8]:
df = makeDateRand()
print(df)

                   A         B         C         D
2013-01-01  0.198261 -0.048306  0.108867  0.558307
2013-01-02 -0.030974 -0.575585  0.921757  0.785422
2013-01-03  1.881196  0.293032  0.861185  0.204870
2013-01-04 -0.088330 -0.874905  0.063219  0.770517
2013-01-05  0.639920  1.059836 -0.004797  0.374886
2013-01-06  2.256758 -0.665419  0.464992 -0.034355


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [9]:
df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                      'E' : pd.Categorical(["test","train","test","train"]),
                      'F' : 'foo' })
print(df2.dtypes)
df2


A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object


Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1,3,test,foo
1,1,2013-01-02,1,3,train,foo
2,1,2013-01-02,1,3,test,foo
3,1,2013-01-02,1,3,train,foo


In [10]:
#create dataframe from a string
from StringIO import StringIO
data = """UsrId JobNos
1 4 
1 5"""
df = pd.read_csv(StringIO(data), sep='\s+')
df

Unnamed: 0,UsrId,JobNos
0,1,4
1,1,5


If you're using IPython, tab completion for column names (as well as public attributes) is automatically enabled:  
    `df2.<Tab>`

In [11]:
df.to_csv('foo.csv')
pd.read_csv('foo.csv')

Unnamed: 0.1,Unnamed: 0,UsrId,JobNos
0,0,1,4
1,1,1,5


In [12]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Unnamed: 0,UsrId,JobNos
0,1,4
1,1,5


###Get the Number of Rows in a DataFrame

In [13]:
df = pd.DataFrame([[1, np.nan], [3, 4]], columns=list('AB'))
print(df)
print('\nlen(df)={}'.format(len(df)))
print('\nshape[0]={}'.format(df.shape[0]))
print('\ndf.count()[0]={}'.format(df.count()[0]))
print('\ndf.count()[1]={}'.format(df.count()[1]))
print('\nNans = \n{}'.format(df.apply(lambda col: pd.isnull(col))))

   A   B
0  1 NaN
1  3   4

len(df)=2

shape[0]=2

df.count()[0]=2

df.count()[1]=1

Nans = 
       A      B
0  False   True
1  False  False


##Manipulating DataFrames

###Adding a row to a DataFrame

When using drop(), note the axis direction.   
- to drop a column axis=1  
- to drop a row axis=0  (default)

In [14]:
#drop a column
df = makegridDF()
print(df.drop('A',axis=1))
print(df.drop(0))

   B  C   D
0  4  7  10
1  5  8  11
2  6  9  12
   A  B  C   D
1  2  5  8  11
2  3  6  9  12


In [15]:
#drop a row
df = makegridDF()
print(df.drop(1))
print(df.drop(0,axis=0))

   A  B  C   D
0  1  4  7  10
2  3  6  9  12
   A  B  C   D
1  2  5  8  11
2  3  6  9  12


In [16]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
# print(df)
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df = df.append(df2)
print(df)

   A  B
0  1  2
1  3  4
0  5  6
1  7  8


In [17]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df.index)
df.loc[df.shape[0]] = ['a','b']
df.loc[df.shape[0]] = ['d','e']

print(df)


Int64Index([0, 1], dtype='int64')
   A  B
0  1  2
1  3  4
2  a  b
3  d  e


In [18]:
#set / change the index name
df.index.name = 'DateIndex'
df

Unnamed: 0_level_0,A,B
DateIndex,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,2
1,3,4
2,a,b
3,d,e


In [19]:
#set index to one of the columns
df.index = df.A
print(df)

   A  B
A      
1  1  2
3  3  4
a  a  b
d  d  e


In [20]:
df2 = pd.concat([df,df[2:4]])
df2

Unnamed: 0_level_0,A,B
A,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,2
3,3,4
a,a,b
d,d,e
a,a,b
d,d,e


In [21]:
#Find duplicate rows
df2['isdup'] = df2.duplicated(subset=['A','B'])
print(df2)
# df = df2[df.duplicated(subset=1)] # creates a new dataframe with the repeated rows
# len(df) # number of duplicates

   A  B  isdup
A             
1  1  2  False
3  3  4  False
a  a  b  False
d  d  e  False
a  a  b   True
d  d  e   True


In [22]:
# creates a new dataframe with the repeated rows
dx = df2[df2.duplicated(subset=['A'])] 
dx

Unnamed: 0_level_0,A,B,isdup
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,a,b,True
d,d,e,True


In [23]:
# Drop rows based on duplicate values on column with label 2
df2.drop_duplicates(subset=['A'], take_last=True, inplace=True)
df2

Unnamed: 0_level_0,A,B,isdup
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,2,False
3,3,4,False
a,a,b,True
d,d,e,True


In [24]:
df.head(2)

Unnamed: 0_level_0,A,B
A,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,2
3,3,4


In [25]:
df.tail(2)

Unnamed: 0_level_0,A,B
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,a,b
d,d,e


In [26]:
df.index

Index([1, 3, u'a', u'd'], dtype='object')

In [27]:
df.columns

Index([u'A', u'B'], dtype='object')

In [28]:
df.values

array([[1L, 2L],
       [3L, 4L],
       ['a', 'b'],
       ['d', 'e']], dtype=object)

In [29]:
df.T

A,1,3,a,d
A,1,3,a,d
B,2,4,b,e


###Adding a column to a dataframe

In [30]:
print(df)
print(df.A)
print(df['A'])
print(df[['A','B']])

   A  B
A      
1  1  2
3  3  4
a  a  b
d  d  e
A
1    1
3    3
a    a
d    d
Name: A, dtype: object
A
1    1
3    3
a    a
d    d
Name: A, dtype: object
   A  B
A      
1  1  2
3  3  4
a  a  b
d  d  e


In [31]:
df = makeDateRand()
df['Total'] = df['A'] + df['B'] + df['C']
print(df)

                   A         B         C         D     Total
2013-01-01 -0.489807  0.768997 -0.699720 -0.797284 -0.420530
2013-01-02  0.059630 -0.114024 -0.595211  1.214019 -0.649605
2013-01-03 -1.117472  0.453205  0.613234  0.900757 -0.051033
2013-01-04  0.664798  0.783614  1.210832 -0.484715  2.659244
2013-01-05 -0.306571  1.061228 -0.315246  1.483469  0.439411
2013-01-06 -0.508798  0.009179  1.041080  0.889748  0.541461


In [32]:
#Adding the sum along a column
df['A'].sum(), df['B'].sum(), df['C'].sum(), 

(-1.6982201962042871, 2.9621983787053594, 1.2549694191645733)

In [33]:
sum_row=df[["A","B",'Total']].sum()
sum_row

A       -1.698220
B        2.962198
Total    2.518948
dtype: float64

We need to transpose the data and convert the Series to a DataFrame so that it is easier to concat onto our existing data. The T function allows us to switch the data from being row-based to column-based.

In [34]:
df_sum=pd.DataFrame(data=sum_row).T
df_sum

Unnamed: 0,A,B,Total
0,-1.69822,2.962198,2.518948


The final thing we need to do before adding the totals back is to add the missing columns. We use reindex to do this for us. The trick is to add all of our columns and then allow pandas to fill in the values that are missing.

In [35]:
df_sum=df_sum.reindex(columns=df.columns)
df_sum

Unnamed: 0,A,B,C,D,Total
0,-1.69822,2.962198,,,2.518948


Now append the totals to the end of the dataframe

In [36]:
df=df.append(df_sum,ignore_index=True)
df.tail()

Unnamed: 0,A,B,C,D,Total
2,-1.117472,0.453205,0.613234,0.900757,-0.051033
3,0.664798,0.783614,1.210832,-0.484715,2.659244
4,-0.306571,1.061228,-0.315246,1.483469,0.439411
5,-0.508798,0.009179,1.04108,0.889748,0.541461
6,-1.69822,2.962198,,,2.518948


###Sorting

In [37]:
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.131006,-0.25936,0.082849,0.715524
2013-01-02,-0.156734,0.254263,0.993153,-0.726136
2013-01-03,-0.663316,-0.989011,0.075403,0.479077
2013-01-04,-1.195212,-0.380858,-1.002542,-0.870212
2013-01-05,1.699788,-0.223632,0.052424,-2.452651
2013-01-06,-1.727329,-1.15599,1.767038,-1.681748


Sort by column index (axis=1)

In [38]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,0.715524,0.082849,-0.25936,-1.131006
2013-01-02,-0.726136,0.993153,0.254263,-0.156734
2013-01-03,0.479077,0.075403,-0.989011,-0.663316
2013-01-04,-0.870212,-1.002542,-0.380858,-1.195212
2013-01-05,-2.452651,0.052424,-0.223632,1.699788
2013-01-06,-1.681748,1.767038,-1.15599,-1.727329


Sort all rows by value in column B

In [39]:
df.sort(columns='B')

Unnamed: 0,A,B,C,D
2013-01-06,-1.727329,-1.15599,1.767038,-1.681748
2013-01-03,-0.663316,-0.989011,0.075403,0.479077
2013-01-04,-1.195212,-0.380858,-1.002542,-0.870212
2013-01-01,-1.131006,-0.25936,0.082849,0.715524
2013-01-05,1.699788,-0.223632,0.052424,-2.452651
2013-01-02,-0.156734,0.254263,0.993153,-0.726136


###Selecting portions

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix. 

[Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing)  
[MultiIndex / Advanced Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced)

Selecting a single column, which yields a Series, equivalent to df.A

In [40]:
df['A']

2013-01-01   -1.131006
2013-01-02   -0.156734
2013-01-03   -0.663316
2013-01-04   -1.195212
2013-01-05    1.699788
2013-01-06   -1.727329
Freq: D, Name: A, dtype: float64

In [41]:
np.asarray(df['A'])

array([-1.13100588, -0.15673441, -0.66331627, -1.19521192,  1.69978788,
       -1.72732923])

Selecting via [], which slices the rows.

In [42]:
df[2:4]

Unnamed: 0,A,B,C,D
2013-01-03,-0.663316,-0.989011,0.075403,0.479077
2013-01-04,-1.195212,-0.380858,-1.002542,-0.870212


In [43]:
df['2013-01-01':'2013-01-02']

Unnamed: 0,A,B,C,D
2013-01-01,-1.131006,-0.25936,0.082849,0.715524
2013-01-02,-0.156734,0.254263,0.993153,-0.726136


###[Selection by Label](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-label)

In [44]:
df = makeDateRand()
df.loc[dates[0]]

A   -1.170698
B    0.511826
C    0.984370
D    0.509954
Name: 2013-01-01 00:00:00, dtype: float64

In [45]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,-1.170698,0.511826
2013-01-02,-1.01624,0.761064
2013-01-03,-1.676073,-0.411738
2013-01-04,-1.926114,1.294009
2013-01-05,-0.213809,-1.204703
2013-01-06,0.024314,0.324155


In [46]:
#endpoints included
df.loc['20130102':'20130104',['A','B']]

Unnamed: 0,A,B
2013-01-02,-1.01624,0.761064
2013-01-03,-1.676073,-0.411738
2013-01-04,-1.926114,1.294009


In [47]:
df.loc['20130102',['A','B']]

A   -1.016240
B    0.761064
Name: 2013-01-02 00:00:00, dtype: float64

In [48]:
#locate a row by index
df.ix[datetime(2013,01,02)]

A   -1.016240
B    0.761064
C   -1.393711
D    0.676826
Name: 2013-01-02 00:00:00, dtype: float64

In [49]:
#locate a row by index
df.irow(1)

A   -1.016240
B    0.761064
C   -1.393711
D    0.676826
Name: 2013-01-02 00:00:00, dtype: float64

In [50]:
#drop some of the rows
df = df.drop(df.index[[0,1,2,3]])
for idx, row in df.iterrows():
    print(idx,row)

(Timestamp('2013-01-05 00:00:00', offset='D'), A   -0.213809
B   -1.204703
C   -0.054600
D   -1.209402
Name: 2013-01-05 00:00:00, dtype: float64)
(Timestamp('2013-01-06 00:00:00', offset='D'), A    0.024314
B    0.324155
C   -1.861520
D   -1.363942
Name: 2013-01-06 00:00:00, dtype: float64)


In [51]:
df = makeDateRand()
for idx, row in df.iterrows():
    row['A'] = 2
    df.ix[idx, 'B'] = 3
    df.ix[idx]['C'] = 4
print(df.head())

            A  B  C         D
2013-01-01  2  3  4 -0.104221
2013-01-02  2  3  4 -0.950576
2013-01-03  2  3  4  0.916640
2013-01-04  2  3  4  0.937856
2013-01-05  2  3  4  0.572961


###[Selection by Position](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-integer)

In [52]:
df.iloc[3]

A    2.000000
B    3.000000
C    4.000000
D    0.937856
Name: 2013-01-04 00:00:00, dtype: float64

In [53]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,2,3
2013-01-05,2,3


In [54]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,2,4
2013-01-03,2,4
2013-01-05,2,4


In [55]:
#slicing rows
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2013-01-02,2,3,4,-0.950576
2013-01-03,2,3,4,0.91664


In [56]:
#slicing columns
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,3,4
2013-01-02,3,4
2013-01-03,3,4
2013-01-04,3,4
2013-01-05,3,4
2013-01-06,3,4


In [57]:
df.iloc[1,1]

3.0

###Boolean indexing and filtering

In [58]:
#filter by single row
df = makeDateRand()
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-01,0.806293,-1.264604,0.184366,-0.435645
2013-01-03,0.339057,-1.960241,-2.204251,0.28978


In [59]:
#filter by multiple row
df2 = df[(df.A>0) & (df.B>0)]
df2

Unnamed: 0,A,B,C,D


In [60]:
#filter specs are pandas time series, which can be manipulated
filt = (df.A>0) & (df.B>0)
print(type(filt), filt)
print('filt.any() = {}'.format(filt.any()))
print('filt.all() = {}'.format(filt.all()))

(<class 'pandas.core.series.Series'>, 2013-01-01    False
2013-01-02    False
2013-01-03    False
2013-01-04    False
2013-01-05    False
2013-01-06    False
Freq: D, dtype: bool)
filt.any() = False
filt.all() = False


In [61]:
#filter by element
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,0.806293,,0.184366,
2013-01-02,,,,
2013-01-03,0.339057,,,0.28978
2013-01-04,,,0.783143,0.160725
2013-01-05,,0.884931,0.962539,
2013-01-06,,,2.330506,1.503962


In [62]:
#isin filtering
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
print(df2)
df2[df2['E'].isin(['two','four'])]

                   A         B         C         D      E
2013-01-01  0.806293 -1.264604  0.184366 -0.435645    one
2013-01-02 -0.105947 -0.639070 -0.341976 -0.682723    one
2013-01-03  0.339057 -1.960241 -2.204251  0.289780    two
2013-01-04 -1.852649 -1.315759  0.783143  0.160725  three
2013-01-05 -0.448001  0.884931  0.962539 -0.740747   four
2013-01-06 -0.687342 -0.248638  2.330506  1.503962  three


Unnamed: 0,A,B,C,D,E
2013-01-03,0.339057,-1.960241,-2.204251,0.28978,two
2013-01-05,-0.448001,0.884931,0.962539,-0.740747,four


In [63]:
#get unique values in a column
df = makefoobar()
print(df)
df.B.unique()

     A      B         C         D
0  foo    one  1.704886  1.170707
1  bar    one -0.813787 -0.153656
2  foo    two  2.170811 -0.318112
3  bar  three -0.986525  0.078411
4  foo    two  1.045485 -0.289139
5  bar    two -0.121183  0.066849
6  foo    one  0.465065 -0.929122
7  foo  three  1.328987  0.887484


array(['one', 'two', 'three'], dtype=object)

###Setting data

Setting a new column automatically aligns the data by the indexes

In [64]:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
df2['F'] = s1
df2

Unnamed: 0,A,B,C,D,E,F
2013-01-01,0.806293,-1.264604,0.184366,-0.435645,one,
2013-01-02,-0.105947,-0.63907,-0.341976,-0.682723,one,1.0
2013-01-03,0.339057,-1.960241,-2.204251,0.28978,two,2.0
2013-01-04,-1.852649,-1.315759,0.783143,0.160725,three,3.0
2013-01-05,-0.448001,0.884931,0.962539,-0.740747,four,4.0
2013-01-06,-0.687342,-0.248638,2.330506,1.503962,three,5.0


In [65]:
# Setting values by label
df.at[dates[0],'A'] = 0
df

Unnamed: 0,A,B,C,D
0,foo,one,1.704886,1.170707
1,bar,one,-0.813787,-0.153656
2,foo,two,2.170811,-0.318112
3,bar,three,-0.986525,0.078411
4,foo,two,1.045485,-0.289139
5,bar,two,-0.121183,0.066849
6,foo,one,0.465065,-0.929122
7,foo,three,1.328987,0.887484
2013-01-01 00:00:00,0,,,


In [66]:
# Setting values by position
df.iat[0,1] = 7
df

Unnamed: 0,A,B,C,D
0,foo,7,1.704886,1.170707
1,bar,one,-0.813787,-0.153656
2,foo,two,2.170811,-0.318112
3,bar,three,-0.986525,0.078411
4,foo,two,1.045485,-0.289139
5,bar,two,-0.121183,0.066849
6,foo,one,0.465065,-0.929122
7,foo,three,1.328987,0.887484
2013-01-01 00:00:00,0,,,


In [67]:
# Setting by assigning with a numpy array
df.loc[:,'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,D
0,foo,7,1.704886,5
1,bar,one,-0.813787,5
2,foo,two,2.170811,5
3,bar,three,-0.986525,5
4,foo,two,1.045485,5
5,bar,two,-0.121183,5
6,foo,one,0.465065,5
7,foo,three,1.328987,5
2013-01-01 00:00:00,0,,,5


In [68]:
# A where operation with setting.
df = makeDateRand()
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D
2013-01-01,-0.467656,-0.234445,-0.33796,-1.558072
2013-01-02,-0.960311,-0.434319,-1.64108,-0.333798
2013-01-03,-0.092155,-0.308327,-0.264266,-1.212069
2013-01-04,-0.732735,-0.024228,-1.537414,-2.531615
2013-01-05,-0.517563,-0.499382,-0.035378,-0.046663
2013-01-06,-1.134443,-0.309736,-1.180724,-0.518965


##Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the [Missing Data section](http://pandas.pydata.org/pandas-docs/version/0.15.2/missing_data.html#missing-data)

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [69]:
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.467656,-0.234445,0.33796,-1.558072,1.0
2013-01-02,0.960311,0.434319,1.64108,-0.333798,1.0
2013-01-03,-0.092155,-0.308327,-0.264266,-1.212069,
2013-01-04,0.732735,-0.024228,1.537414,2.531615,


To drop any rows that have missing data.

In [70]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.467656,-0.234445,0.33796,-1.558072,1
2013-01-02,0.960311,0.434319,1.64108,-0.333798,1


Filling missing data

In [71]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.467656,-0.234445,0.33796,-1.558072,1
2013-01-02,0.960311,0.434319,1.64108,-0.333798,1
2013-01-03,-0.092155,-0.308327,-0.264266,-1.212069,5
2013-01-04,0.732735,-0.024228,1.537414,2.531615,5


To get the boolean mask where values are nan

In [72]:
pd.isnull(df1)

Unnamed: 0,A,B,C,D,E
2013-01-01,False,False,False,False,False
2013-01-02,False,False,False,False,False
2013-01-03,False,False,False,False,True
2013-01-04,False,False,False,False,True


##Operations
###Binary operations

See the Basic section on [Binary Ops](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-binop)  

Operations in general exclude missing data.

In [73]:
df.mean()

A    0.291686
B    0.009494
C    0.744715
D   -0.016670
dtype: float64

In [74]:
#along the other axis
df.mean(1)

2013-01-01   -0.480553
2013-01-02    0.675478
2013-01-03   -0.469204
2013-01-04    1.194384
2013-01-05   -0.007366
2013-01-06    0.631099
Freq: D, dtype: float64

###Applying functions to the data

When using apply(), note the axis direction.   
- for each column, apply down a row: axis=0  (default)
- for each row, apply across columns: axis=1.

In [75]:
df = makegridDF()
print(df)
print(df.apply(np.cumsum))
print(df.apply(np.cumsum, axis=0))
print(df.apply(np.cumsum, axis=1))

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A   B   C   D
0  1   4   7  10
1  3   9  15  21
2  6  15  24  33
   A   B   C   D
0  1   4   7  10
1  3   9  15  21
2  6  15  24  33
   A  B   C   D
0  1  5  12  22
1  2  7  15  26
2  3  9  18  30


In [76]:
df = makeDateRand()
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D
2013-01-01,-1.185344,0.466891,-1.393559,-0.003613
2013-01-02,-0.618737,-0.265284,-1.061658,-0.452598
2013-01-03,0.327723,-0.589845,-0.50787,0.448347
2013-01-04,-0.665326,-0.143526,-1.726438,1.597865
2013-01-05,-0.79325,0.682025,-1.381894,-0.32292
2013-01-06,-0.698137,0.743583,-1.9929,-0.177468


In [77]:
df.apply(lambda x: x.max() - x.min())

A    2.131804
B    1.557725
C    1.947348
D    3.070303
dtype: float64

In [78]:
from datetime import datetime
df = makeDateRand()
df.index.name = 'Date'
df.reset_index(level=0,inplace=True)
print(df)
#convert to string format
df.Date = df.Date.apply(lambda d: ' '.join(d.isoformat().split('T')))
print(df)
#convert back to datetime format
df.Date = df.Date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d %H:%M:%S"))
print(df)

df.index = df.Date
print(df)

        Date         A         B         C         D
0 2013-01-01  0.990319  0.174850 -0.091437  1.751810
1 2013-01-02  2.123177  0.510229 -1.563843  0.338379
2 2013-01-03 -0.808509  1.184882  2.537792  0.488163
3 2013-01-04  1.323992  1.144447  0.642264 -1.036075
4 2013-01-05  0.834353 -1.300357  0.184578 -1.460907
5 2013-01-06  0.231945  0.842537  0.117507  2.683755
                  Date         A         B         C         D
0  2013-01-01 00:00:00  0.990319  0.174850 -0.091437  1.751810
1  2013-01-02 00:00:00  2.123177  0.510229 -1.563843  0.338379
2  2013-01-03 00:00:00 -0.808509  1.184882  2.537792  0.488163
3  2013-01-04 00:00:00  1.323992  1.144447  0.642264 -1.036075
4  2013-01-05 00:00:00  0.834353 -1.300357  0.184578 -1.460907
5  2013-01-06 00:00:00  0.231945  0.842537  0.117507  2.683755
        Date         A         B         C         D
0 2013-01-01  0.990319  0.174850 -0.091437  1.751810
1 2013-01-02  2.123177  0.510229 -1.563843  0.338379
2 2013-01-03 -0.808509  1.184

###Histograms

[Histogramming and Discretization](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-discretization)

In [79]:
s = pd.Series(np.random.randint(0,7,size=10))
s.value_counts()

6    3
0    3
4    1
3    1
2    1
1    1
dtype: int64

### Strings
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses [regular expressions](https://docs.python.org/2/library/re.html) by default (and in some cases always uses them). See more at [Vectorized String Methods](http://pandas.pydata.org/pandas-docs/version/0.15.2/text.html#text-string-methods).

In [80]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
s.str.upper()
s.str.len()

0     1
1     1
2     1
3     4
4     4
5   NaN
6     4
7     3
8     3
dtype: float64

In [81]:
# Methods like split return a Series of lists:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_')

0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

In [82]:
s2.str.split('_').str[1]

0      b
1      d
2    NaN
3      g
dtype: object

In [83]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
s.str[1]

0    NaN
1    NaN
2    NaN
3      a
4      a
5    NaN
6      A
7      o
8      a
dtype: object

You can use [] notation to directly index by position locations. If you index past the end of the string, the result will be a NaN.

In [84]:
# Easy to expand this to return a DataFrame
s2.str.split('_').apply(pd.Series)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


Methods like replace and findall take regular expressions, too:

In [85]:
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3.str.replace('^.a|dog', 'XX-XX ', case=False)

0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

###Merge and concat

In [86]:
# Concatenating pandas objects together
df = pd.DataFrame(np.random.randn(10, 4))
print(df)
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)

          0         1         2         3
0 -0.312554  1.035760 -0.001916 -0.021936
1 -1.268133  2.296992  1.607537 -0.152936
2 -1.452718  0.265496  0.711312  0.811290
3  1.226806 -0.492936  0.570308 -1.380303
4  0.476827  0.598290 -1.162419 -0.242144
5  0.181001  1.510630  0.789038 -1.992821
6  0.363504 -0.344474  1.802882  0.568088
7  0.452809 -0.495252  1.359618  1.401802
8 -0.043043  0.252168 -0.085674  0.450731
9  0.669227 -1.000953  1.721593 -1.993525


Unnamed: 0,0,1,2,3
0,-0.312554,1.03576,-0.001916,-0.021936
1,-1.268133,2.296992,1.607537,-0.152936
2,-1.452718,0.265496,0.711312,0.81129
3,1.226806,-0.492936,0.570308,-1.380303
4,0.476827,0.59829,-1.162419,-0.242144
5,0.181001,1.51063,0.789038,-1.992821
6,0.363504,-0.344474,1.802882,0.568088
7,0.452809,-0.495252,1.359618,1.401802
8,-0.043043,0.252168,-0.085674,0.450731
9,0.669227,-1.000953,1.721593,-1.993525


In [87]:
# Repeat some rows in  data frame
df2 = pd.concat([df,df[2:4]])
df2

Unnamed: 0,0,1,2,3
0,-0.312554,1.03576,-0.001916,-0.021936
1,-1.268133,2.296992,1.607537,-0.152936
2,-1.452718,0.265496,0.711312,0.81129
3,1.226806,-0.492936,0.570308,-1.380303
4,0.476827,0.59829,-1.162419,-0.242144
5,0.181001,1.51063,0.789038,-1.992821
6,0.363504,-0.344474,1.802882,0.568088
7,0.452809,-0.495252,1.359618,1.401802
8,-0.043043,0.252168,-0.085674,0.450731
9,0.669227,-1.000953,1.721593,-1.993525


SQL style merges. See the [Database style joining](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-join)

In [88]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print(left)
print(right)
pd.merge(left, right, on='key')

   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5


Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


Append rows to a dataframe. See the [Appending](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-concatenation)

In [89]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,0.026858,0.618968,-0.06791,0.295114
1,-2.727626,1.51913,-1.103879,-0.083556
2,-0.766484,0.329392,1.979625,-0.818243
3,-1.103593,0.1579,-0.594812,0.222186
4,-0.767651,0.71736,-0.713674,-0.729033
5,-0.097077,-0.821013,0.752209,-0.586203
6,-0.247473,-0.526865,-1.967403,0.587469
7,1.124842,-1.09014,-0.5178,-0.172108
8,-1.103593,0.1579,-0.594812,0.222186


###Grouping
By “group by” we are referring to a process involving one or more of the following steps  

- Splitting the data into groups based on some criteria  
- Applying a function to each group independently    
- Combining the results into a data structure   

[grouping](http://pandas.pydata.org/pandas-docs/version/0.15.2/groupby.html#groupby)

In [90]:
df = makefoobar()
df

Unnamed: 0,A,B,C,D
0,foo,one,0.44883,1.250155
1,bar,one,0.60904,-0.216908
2,foo,two,-2.42389,-0.558726
3,bar,three,0.820044,1.364242
4,foo,two,-0.384493,0.178097
5,bar,two,0.119828,0.512074
6,foo,one,1.395724,0.145416
7,foo,three,-0.496975,0.298407


Grouping and then applying a function sum to the resulting groups.

In [91]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,1.548911,1.659408
foo,-1.460804,1.313348


In [92]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.60904,-0.216908
bar,three,0.820044,1.364242
bar,two,0.119828,0.512074
foo,one,1.844554,1.395571
foo,three,-0.496975,0.298407
foo,two,-2.808382,-0.380629


In [93]:
df = makefoobar()
print(df)
cnts = {}
#first value is group column value, seond value is the members in the group
for grp, grp_data in df.groupby("B"):
    cnts[grp] = grp_data.C.mean()  
cnts

     A      B         C         D
0  foo    one  0.345037  1.120149
1  bar    one -0.983207  1.304603
2  foo    two -0.469320  0.125918
3  bar  three  0.312737  0.146465
4  foo    two  0.590574 -0.566539
5  bar    two  0.666293  0.185597
6  foo    one -0.442139  1.421216
7  foo  three  0.899687  0.275622


{'one': -0.36010299230130288,
 'three': 0.60621232215056042,
 'two': 0.26251537978182976}

###Reshaping

[Hierarchical Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced-hierarchical) and [Reshaping](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-stacking).

In [94]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                    'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                    'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.016842,1.446916
bar,two,-1.342349,-0.700087
baz,one,0.811508,-0.058965
baz,two,0.33748,1.236764


The stack function “compresses” a level in the DataFrame’s columns.

In [95]:
stacked = df2.stack()
stacked

first  second   
bar    one     A    1.016842
               B    1.446916
       two     A   -1.342349
               B   -0.700087
baz    one     A    0.811508
               B   -0.058965
       two     A    0.337480
               B    1.236764
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:

In [96]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.016842,1.446916
bar,two,-1.342349,-0.700087
baz,one,0.811508,-0.058965
baz,two,0.33748,1.236764


In [97]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,1.016842,-1.342349
bar,B,1.446916,-0.700087
baz,A,0.811508,0.33748
baz,B,-0.058965,1.236764


In [98]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,1.016842,0.811508
one,B,1.446916,-0.058965
two,A,-1.342349,0.33748
two,B,-0.700087,1.236764


###Pivot tables
See the section on [Pivot Tables](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-pivot).

In [99]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,0.220966,-0.407925
1,one,B,foo,0.717566,0.335522
2,two,C,foo,0.273564,-0.904677
3,three,A,bar,-1.523647,-1.088773
4,one,B,bar,-1.980749,-0.37593
5,one,C,bar,-0.417124,-0.28941
6,two,A,foo,0.006494,-0.415106
7,three,B,foo,-0.7856,-0.970769
8,one,C,foo,-0.493092,0.351793
9,one,A,bar,-0.058732,-0.292374


In [100]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.058732,0.220966
one,B,-1.980749,0.717566
one,C,-0.417124,-0.493092
three,A,-1.523647,
three,B,,-0.7856
three,C,1.494698,
two,A,,0.006494
two,B,0.112476,
two,C,,0.273564


##Categoricals
see the [categorical introduction](http://pandas.pydata.org/pandas-docs/version/0.15.2/categorical.html#categorical) and the [API documentation](http://pandas.pydata.org/pandas-docs/version/0.15.2/api.html#api-categorical).

In [101]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
# Convert the raw grades to a categorical data type.
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a < b < e]

Rename the categories to more meaningful names (assigning to Series.cat.categories is in place!) Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series per default).

In [102]:
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad < bad < medium < good < very good]

In [103]:
# Sorting is per order in the categories, not lexical order.
df.sort("grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


In [104]:
# Grouping by a categorical column shows also empty categories.
df.groupby("grade").size()

grade
very bad      1
bad         NaN
medium      NaN
good          2
very good     3
dtype: float64

##Comparing and Gotchas
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-compare>  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#boolean-reductions>   

pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not.  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/gotchas.html#gotchas>


##Copies and no copies

In [105]:
#to follow

## Python and [module versions, and dates](http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-0-Scientific-Computing-with-Python.ipynb)

In [106]:
%load_ext version_information
%version_information pandas, numpy, scipy, matplotlib, pyradi

Software,Version
Python,2.7.8 32bit [MSC v.1500 32 bit (Intel)]
IPython,3.0.0
OS,Windows 7 6.1.7601 SP1
pandas,0.15.2
numpy,1.9.2
scipy,0.15.1
matplotlib,1.4.3
pyradi,0.1.55
Thu Apr 09 20:22:53 2015 South Africa Standard Time,Thu Apr 09 20:22:53 2015 South Africa Standard Time
